Search Results: "Keith Packard"

25 August 2013

Joey Hess: idea: git push requests

This is an idea that Keith Packard told me. It's a brilliant way to reduce GitHub's growing lockin, but I don't know how to implement it. And I almost forgot about it, until I had another annoying "how do I send you a patch with this amazing git technology?" experience and woke up with my memory refreshed. The idea is to allow anyone to git push to any anonymous git:// repository. But the objects pushed are not stored in a public part of the repository (which could be abused). Instead the receiving repository emails them off to the repository owner, in a git-am-able format. So this is like a github pull request except it can be made on any git repository, and you don't have to go look up the obfuscated contact email address and jump through git-format-patch hoops to make it. You just commit changes to your local repository, and git push to wherever you cloned from in the first place. If the push succeeds, you know your patch is on its way for review. Keith may have also wanted to store the objects in the repository in some way that a simple git command run there could apply them without the git-am bother on the receiving end. I forget. I think git-am would be good enough -- and including the actual diffs in the email would actually make this far superior to github pull request emails, which are maximally annoying by not doing so.

Hmm, I said I didn't know how to implement this, but I do know one way. Make the git-daemon run an arbitrary script when receiving a push request. A daemon.pushscript config setting could enable this. The script could be something like this:

#!/bin/sh
set -e
tmprepo="$(mktemp -d)"
# this shared clone is *really* fast even for huge repositories, and uses
# only a few 100 kb of disk space!
git clone --shared --bare "$GIT_DIR" "$tmprepo"
git-receive-pack "$tmprepo"
# XXX add email sending code here.
rm -rf "$tmprepo"

Of course, this functionality could be built into the git-daemon too. I suspect a script hook and an example script in contrib/ might be an easier patch to get accepted into git though. That may be as far as I take this idea, at least for now..

7 August 2013

Keith Packard: embedded cpus

Choosing Embedded Processors for AltOS When Bdale and I started building rocketry hardware together, we had already picked out a target processor, the TI cc1111. We picked that chip almost entirely based on the digital transceiver that is built in to the chip, and not because we had any particular love for the 8051 microcontroller. At that time, I d seen people struggle with PIC processors, battle AVR to a draw and spend a lot of time trying to get various ARM processors running. So, the 8051 didn t seem all that far from normal, and the cc1111 implementation of it is pretty reasonable, including credible USB support and a built-in DMA engine. Since those early days, we ve gone on to build boards with a slightly wider range of processors:

TI cc1111
Atmel ATmega32U4
Atmel ATiny85
STMicroelectronics STM32L151/STM32L152
NXP LPC11U14

Bdale thinks we should reduce the number of components we use to focus our efforts better. He s probably right, but I have to admit that I ve had way too much fun getting each of these chips running. I thought I d spend a bit of time describing our general process for selecting a new CPU.

Free software. I m long past the age at which I trust hardware vendors to deliver credible binary software. If you want me to use your silicon, make sure the tools required to use it are all free software.
Debian support required. Bdale and I do everything on Debian, so any processor we use needs to have tools that run under Debian. We re more than willing to package software for Debian, so if it s just generic Linux-compatible free software, that s fine.
Built-in USB. Everything we build needs to talk to a computer for configuration and data download. Don t even think of asking me to attach another chip for USB support. Ok, so the ATtiny85 doesn t qualify here, and that project does use an external chip for USB which is connected to the ATtiny85 through an LED and phototransistor. I was excited when TI came out with the CC340 chips because the MSP430 is a much nicer CPU than the 8051. But, they still haven t added USB. If I m going to add another chip for USB, it ll be a CPU anyways, at which point I ll use a stand-alone RF chip.
Cheap developer boards. TI sells a tiny little cc1111 developer board; STmicroelectronics and NXP both sell boards that include a programmer and a target CPU for prototyping software. And, there are a million Arduino clones with various Atmel processors on them. Cheap means I can buy them without thinking too hard; having a built-in programmer means that early OS development can be done without first having to figure out how to wire up a programmer.
Probably 32 bits, probably an ARM. Yeah, the cc1111 and Atmel processors were fun to bring up, but I m pretty much over the 8-bit microcontroller now. ARM processors are far more sophisticated in terms of power management and built-in device capability, plus they re a ton faster at doing any math. The ATtiny85 remains an exception here; it s smaller than any ARM I can get, and so it s still useful for projects too small for an ARM.
Available in small quantities from a reasonable on-line retailer like Digikey, Mouser or even Avnet Express. I d love to try out the new low-cost STM32L100 parts, but the only one available is the largest one in the 64LQFP package, and I really don t need that many pins for most products.

CC1111 involved a lot of software hacking The 8051 processor in the CC1111 is very well documented, including the two-wire debugging interface. What was missing was a canned solution for programming and debugging the chip from Debian. However, with sufficient motivation and accurate docs, I was able to create a programmer from a USB device with a couple of GPIOs and a lot of ugly software on Linux. That sufficed to get the USB stack limping along, at which point I wrote a much faster programmer that ran on the cc1111 itself and could program another cc1111. With that, I created an assembly-level debugger that could be hooked to the existing SDCC source level debugger and had a full source-level debugging environment for the 8051. This turned out to be way better than what I could get on the Atmel processors, where I ve got a program loader and a whole lot of printf debugging. STM has been great The STM32L-Discovery board has a standard STM debugging setup for the Cortex SWD interface right on the same board as a target CPU. That made for completely self-contained development, with no jumper wires (a first for me, for sure). There s free software, stlink which can talk over the debugger USB connection to drive the SWD interface. This is sufficient to flash and debug applications using GDB. Of course, GCC supports ARM quite well; the only hard part was figuring out what to do for a C library. I settled on pdclib, which is at least easy to understand, if not highly optimized for ARM. We ve built a ton of boards with the STM32L151 and STM32L152; at this point I m pretty darn comfortable with the architecture and our tool chain. Adventures with NXP The NXP LPC11U14 is a very different beast. I m using this because:

It s small (5mm x 5mm)
It s cheap ($1.48 q100)

The LPCXpresso board looks much like the STM32L-Discovery, with a debugger interface wired to the CPU directly on the board. However, I haven t found free software tools to drive this programmer; all I ve found are binary-only tools from NXP. No thanks. Fortunately, the LPC11U14 uses exactly the same SWD interface as the STM32L, so I was able to sever the link between the programmer and the target on the LPCXpresso board and hook the target to an ST-Link device (either the one on the STM32L-Discovery board, or the new stand-along programming dongle I bought). With that, I wrote an openocd script to talk to the LPC11U14 and was in business. What I found in the NXP processor was a bit disturbing though there s a mask ROM that contains a boot loader, which always runs when the chip starts, and a bunch of utility code, including the only documented interface to programming the flash memory. I cannot fathom why anyone thought this was a good idea I don t want a BIOS in my embedded CPU, thankyouverymuch; I d really like my code to be the first instructions executed by the CPU. And, any embedded developer is more than capable of programming flash using a register specification, and calling some random embedded code in ROM from the middle of my operating system is more than a bit scary. NXP could do two simple things to make me like their parts a whole lot more:

Publish the programmer source code. The programming dongle is a tool for selling your silicon, not a separate revenue stream. Publish the source under a free software license (GPL recommended, of course). We ll take it, make it better, and ensure that you get the changes back to help your other customers. It s a win-win plan.
Document the flash programming registers. Yeah, I could probably disassemble the mask-rom bits that are on the chip, but I have better things to do with my time. Heck, go wild and document any other hidden bits in the silicon. Again, we re embedded developers, we don t like magic hardware hidden from view any more than PC developers like the BIOS and SMM.

Right now, I m hoping the STM32L100C6 parts become available in small quantities so I can try them out; they promise to be as cheap as the LPC11U14, but are better supported by free software and offer more complete hardware documentation. Yeah, they re a bit larger; that will probably be annoying.

Keith Packard: Cursor tracking

Tracking Cursor Position I spent yesterday morning in the Accessibility BOF here at Guadec and was reminded that one persistent problem with tools like screen magnifiers and screen readers is that they need to know the current cursor position all the time, independent of which window the cursor is in and independent of grabs. The current method that these applications are using to track the cursor is to poll the X server using XQueryPointer. This is obviously terrible for at least a couple of reasons:

Keeps the system active at regular intervals, preventing power savings.
Increased latency in mouse tracking the interval between polling calls limits the time resolution of the position information.

These two problems also conflict with one another. Reducing input latency comes at the cost of further reducing the opportunities for power saving, and vice versa. XInput2 to the rescue (?)

XInput2 has the ability to deliver raw device events right to applications, bypassing the whole event selection mechanism within the X server. This was designed to let games and other applications see relative mouse motion events and drawing applications see the whole tablet surface. These raw events are really raw though; they do not include the cursor position, and so cannot be directly used for tracking. However, we do know that the cursor only moves in response to input device events, so we can easily use the arrival of a raw event to trigger a query for the mouse position. A better plan? Perhaps what we should do is to actually create a new event type to report the cursor position and the containing window so that applications can simply track that. Yeah, it s a bit of a special case, but it s a common requirement for accessibility tools.

 
    CursorEvent
        EVENTHEADER
        detail:                    CARD32
        sourceid:                  DEVICEID
        flags:                     DEVICEEVENTFLAGS
    root:                      WINDOW
    window:                    WINDOW
    root-x, root-y:            INT16
    window-x, window-y:        INT16

A CursorEvent is sent whenever a sprite moves on the screen. sourceid is the master pointer which is moving. root is the root window containing the cursor, window is the window that the pointer is in. root-x and root-y indicate the position within the root window, window-x and window-y indicate the position within window . Demo Application Here s a short application, hacked from Peter Hutterer s part1.c

/* cc -o track_cursor track_cursor.c  pkg-config --cflags --libs xi x11  */
#include <stdio.h>
#include <string.h>
#include <X11/Xlib.h>
#include <X11/extensions/XInput2.h>
/* Return 1 if XI2 is available, 0 otherwise */
static int has_xi2(Display *dpy)
 
    int major, minor;
    int rc;
    /* We support XI 2.2 */
    major = 2;
    minor = 2;
    rc = XIQueryVersion(dpy, &major, &minor);
    if (rc == BadRequest)  
    printf("No XI2 support. Server supports version %d.%d only.\n", major, minor);
    return 0;
      else if (rc != Success)  
    fprintf(stderr, "Internal Error! This is a bug in Xlib.\n");
     
    printf("XI2 supported. Server provides version %d.%d.\n", major, minor);
    return 1;
 
static void select_events(Display *dpy, Window win)
 
    XIEventMask evmasks[1];
    unsigned char mask1[(XI_LASTEVENT + 7)/8];
    memset(mask1, 0, sizeof(mask1));
    /* select for button and key events from all master devices */
    XISetMask(mask1, XI_RawMotion);
    evmasks[0].deviceid = XIAllMasterDevices;
    evmasks[0].mask_len = sizeof(mask1);
    evmasks[0].mask = mask1;
    XISelectEvents(dpy, win, evmasks, 1);
    XFlush(dpy);
 
int main (int argc, char **argv)
 
    Display *dpy;
    int xi_opcode, event, error;
    XEvent ev;
    dpy = XOpenDisplay(NULL);
    if (!dpy)  
    fprintf(stderr, "Failed to open display.\n");
    return -1;
     
    if (!XQueryExtension(dpy, "XInputExtension", &xi_opcode, &event, &error))  
       printf("X Input extension not available.\n");
          return -1;
     
    if (!has_xi2(dpy))
    return -1;
    /* select for XI2 events */
    select_events(dpy, DefaultRootWindow(dpy));
    while(1)  
    XGenericEventCookie *cookie = &ev.xcookie;
    XIRawEvent      *re;
    Window          root_ret, child_ret;
    int         root_x, root_y;
    int         win_x, win_y;
    unsigned int        mask;
    XNextEvent(dpy, &ev);
    if (cookie->type != GenericEvent  
        cookie->extension != xi_opcode  
        !XGetEventData(dpy, cookie))
        continue;
    switch (cookie->evtype)  
    case XI_RawMotion:
        re = (XIRawEvent *) cookie->data;
        XQueryPointer(dpy, DefaultRootWindow(dpy),
                  &root_ret, &child_ret, &root_x, &root_y, &win_x, &win_y, &mask);
        printf ("raw %g,%g root %d,%d\n",
            re->raw_values[0], re->raw_values[1],
            root_x, root_y);
        break;
     
    XFreeEventData(dpy, cookie);
     
    return 0;

Hacks in xeyes Of course, one common mouse tracking application is xeyes, so I ve hacked up that code (on top of my present changes) here:

git clone git://people.freedesktop.org/~keithp/xeyes.git

24 July 2013

Keith Packard: present-redirect

Present Extension Redirection Multi-buffered applications have always behaved poorly in the presence of the Composite extension:

Updating involves multiple copies of the window contents, first from back buffer to composite redirect buffer, thence from the composite buffer to the screen back buffer, and then from the screen back buffer to the scanout buffer. The last copy is amenable to page flipping, but that would require the EGL buffer age extension so that the compositor could reduce the cost of presenting following frames by being able to track the contents of the back buffers.
The application is not informed about when the actual screen presentation occurs.

Owen Taylor suggested that Present should offer a way to redirect operations to the compositing manager as a way to solve these problems. This posting is my attempt to make that idea a bit more concrete given the current Present design. Design Goals Here s a list of features I think we should try to provide:

Provide accurate information to applications about when presentation to the screen actually occurs. In particular, GLX applications using GLX_OML_sync_control should receive the correct information in terms of UST and MSC for each Swap Buffers request.
Ensure that applications still receive correct information as to the contents of their buffers, in particular we want to be able to implement EGL_EXT_buffer_age in a useful manner.
Avoid needing to un-redirect full-screen windows to get page flipping behavior.
Eliminate all extra copies. A windowed application may still perform one copy from back buffer to scanout buffer, but there shouldn t be any reason to copy contents to the composite redirection buffer or the compositing manager back buffer.

Simple Present Redirection With those goals in mind, here s what I see as the sequence of events for a simple windowed application doing a new full-window update without any translucency or window transformation in effect:

Application creates back buffer, draws new frame to it.
Application executes PresentRegion. In this example, the valid and update parameters are None , indicating that the full window should be redrawn.
The server captures the PresentRegion request and constructs a PresentRedirectNotify event containing sufficient information for the compositor to place that image correctly on the screen:
- target window for the presentation
- source pixmap containing the new image
- idle fence to notify when the source pixmap is no longer in use.
- serial number from the request.
- target MSC for the presentation. This should probably just be the computed absolute MSC value, and not the original target/divisor/numerator values provided by the application.
The compositing manager receives this event and constructs a new PresentRegion request using the provided source pixmap, but this time targeting the root window, and constructing a valid region which clips the pixmap to the shape of the window on the screen. This request would use the original application s idle fence value so that when complete, the application would get notified. This request would need to also include the original target window and serial number so that a suitable PresentCompleteNotify event can be constructed and delivered when the final presentation is complete.
The server executes this new PresentRegion operation. When complete, it delivers PresentCompleteNotify events to both the compositing manager and the application.
Once the source pixmap is no longer in use (either the copy has completed, or the screen has flipped away from this pixmap), the server triggers the idle fence.

Multiple Application Redirection If multiple applications perform PresentRegion operations within the same frame, then the compositing manager will receive multiple PresentRedirectNotify events, and can simply construct multiple new PresentRegion requests. If these are all queued to the same global MSC, they will execute at the same frame boundary. No inter-operation dependency exists here. Complex Presentations Ok, so the simple case looks like it s pretty darn simple to implement, and satisfies the design goals nicely. Let s look at a couple of more complicated cases in common usage; the first is with translucency, the second with scaling application images down to thumbnails and the third with partial application window updates. Redirection with Translucency If the compositing manager discovers that a portion of the updated region overlays or is overlaid by something translucent (either another window, or drop shadows, or something else), then a composite image for that area must be constructed before the whole can be presented. Starting when the compositing manager receives the event, we have:

The compositing manager receives this event. Using the new pixmap, along with pixmaps for the other involved windows and graphical elements, the compositing manager constructs an updated view for the affected portion of the screen in a back buffer pixmap. Once complete, a PresentRegion operation that uses this back buffer pixmap is sent to the X server. Again, the original target window and serial number are also delivered to the server so that a suitable PresentCompleteNotify event can be delivered to the application.
The server executes this new PresentRegion operation; PresentCompleteNotify events are delivered, and idle fences triggered as appropriate.

Redirection with Transformation Transformation of the window contents means that we cannot always update a portion of the back buffer directly from the provided application pixmap as that will not contain the window border. Contents generated from a region that includes both application pixels and window border pixels must be sourced from a single pixmap containing both sets of pixels. One option that I ve discussed in the past to solve this would be to have the original application allocate the pixmap large enough to hold both the application contents and the surrounding window border. Have it draw the application contents at the correct offset within this pixmap, and then have the window manager contents drawn around that; either automatically by the X server, or even manually by the compositing manager. That would be mighty convenient for the compositing manager, but would require significant additional infrastructure throughout the X server and even harder the drawing system (OpenGL or some other system). There s another reason to want this though, and that s for sub-frame buffer scanout page swapping. The second option would be for the compositing manager to combine these images itself; there s a nice pixmap already containing the window manager image the composite redirect buffer. Taking the provided source pixmap and copying it directly to the target window will construct our composite image, just as if we had no Present redirection in place. This will cost an additional copy though, which we ve promised to avoid. Of course, as it s just for thumb-nailing or other visual effects, perhaps the compositing manager could perform this operation at a reduced frame rate, so that overall system performance didn t suffer. Retaining Access to the Application Buffer Above, I discussed having the idle fence from the redirected PresentRegion operation be sent along with the replacement PresentRegion operation. This ignores the fact that the composting manager may well need the contents of that application frame again in the future, when displaying changes for other applications that involve the same region of the screen. With the goal of making sure the idle fences are triggered as soon as possible so that applications can re-use or free the buffers quickly, let s think about when the triggering can occur.

Full-screen flipped applications. In this case, the application s idle fence can be triggered once the application provides a new frame and the X server has flipped to that new frame, or some other scanout buffer.
Windowed, copied applications. In this case, the application s idle fence can be triggered once the application provides a new frame to the compositing manager, and the X server doesn t have any presentations queued.

In both cases, we require that both the X server and the compositing manager be finished with the buffer before the application s idle fence should be triggered. One easy way to get this behavior is for the composting manager to create a new idle fence for its operations. When that is triggered, it would receive an X event and then trigger the applications idle fence as appropriate. This would add considerable latency to the application s idle fence a round trip through the compositing manager. The alternative would be to construct some additional protocol to make the applications idle fence dependent on the Present operation and some additional state provided by the compositing manager. Some experimentation here is warranted, but my experience with latency in this area is that it causes applications to end up allocating another back buffer as the idle notification arrives just after a buffer allocation request comes down to the rendering library. Definitely sub-optimal. An Aside on Media Stream Counters The GLX_OML_sync_control extension defines the Media Stream Counter (MSC) as a counter unique to the graphics subsystem which is incremented each time a vertical retrace occurs. That would be trivial if we had only one vertical retrace sequence in the world. However, we actually have N+1 such counters, one for each of the N active monitors in the system and a separate fake counter to be used when none of the other counters is available. In the current Present implementation, windows transition among the various Media Stream Counter domains as they move between the various monitors, and those monitors get turned on and off. As they move between these counter domains, Present tracks a global offset from their original domain. This offset ensures that the MSC value remains monotonically increasing as seen by each window. What it does not ensure is that all windows have comparable MSC sequence values; two windows on the same monitor may well have different MSC values for the same physical retrace event. And, even moving a window from one MSC domain to another and back won t make it return to the original MSC sequence values due to differences in refresh rates between the monitors. Internally, Present asserts that each CRTC in the system identifies a unique MSC domain, and it has a driver API which identifies which CRTC a particular window should be associated with. Once a particular CRTC has been identified for a window, client-relative MSC values and CRTC-relative MSC values can be exchanged using an offset between that CRTC MSC domain and the window MSC domain. The Intel driver assigns CRTCs to windows by picking the CRTC showing the greatest number of pixels for a particular window. When two CRTCs show the same number of pixels, the Intel driver picks the first in the list. Vblank Synchronization and Multiple Monitors Ok, so each window lives in a particular MSC domain, clocked by the MSC of the CRTC the driver has associated it with. In an un-composited world, this makes picking when to update the screen pretty simple; Present updates the screen when vblank happens in the CRTC associated with the window. In the composite redirected case, it s a bit harder all of the PresentRegion operations are going to target the root window, and yet we want updates for each window to be synchronized with the monitor containing that window. Of course, the root window belongs to a single MSC domain (likely the largest monitor, using the selection algorithm described above from the Intel driver). So, any PresentRegion requests will be timed relative to that single monitor. I think what is required here is for the PresentRegion extension to take an optional CRTC argument, which would then be used as the MSC domain instead of the window MSC domain. All of the timing arguments would be interpreted relative to that CRTC MSC domain. The PresentRedirectNotify event would then contain the relevant CRTC and the MSC value would be relative to that CRTC. A clever Compositing manager could then decompose a global PresentRegion operation into per-CRTC PresentRegion operations and ensure that multiple monitors were all synchronized correctly. We could take this even further and have the PresentRegion capable of passing a smaller CRTC-sized pixmap down to the kernel, effectively providing per-CRTC pixmaps with no visible explicit protocol Other Composite Users Ok, so the above discussion is clearly focused on getting the correct contents onto the screen with minimal copies along the way. However, what I ve ignored is how to deal with other applications, also using Composite at the same time. They re all going to expect that the composite redirect buffers will contain correct window contents at all times, and yet we ve just spent a bunch of time making that not be the case to avoid copying data into those buffers and instead copying directly to the compositing manager back or front buffers. Obviously the X server is aware of when this happens; the compositing manager will have selected for manual redirection on all top-level windows, while our other application will have only been able to select for automatic redirection. So, we ve got two pretty clear choices here:

Have the X server change how Present redirection works when some other application selects for Automatic redirection on a window. It would copy the source pixmap into the window buffer and then send (a modified?) PresentRedirectNotify event to the compositing manager.
Include a flag in the PresentRedirectNotify event that the composite redirect buffer needs to eventually get the contents of the source pixmap, and then expect the compositing manager to figure out what to do.

Development Plans As usual, I m going to pick the path of least resistance for all of the above options and see how things look; where the easy thing works, we can keep using it. Where the easy thing fails, I ll try something else. The changes required for this are pretty minimal. The PresentRegion request needs to gain a list of window/serial pairs that are also to be notified when the operation completes:

PRESENTNOTIFY  
    window: WINDOW
    serial: CARD32
     
 
    PresentRegion
    window: WINDOW
    pixmap: PIXMAP
    serial: CARD32
    valid-area: REGION or None
    update-area: REGION or None
    x-off, y-off: INT16
    idle-fence: FENCE
    target-crtc: CRTC or None
    target-msc: CARD64
    divisor: CARD64
    remainder: CARD64
    notifies: LISTofPRESENTNOTIFY
 
    Errors: Window, Pixmap, Match

The target-crtc parameter explicitly identifies a CRTC MSC domain. If None, then this request implicitly uses the window MSC domain. notifies provides a list of windows that will also receive PresentCompleteNotify events with the associated serial number when this PresentRegion operation completes.

 
    PresentRedirectNotify
    type: CARD8         XGE event type (35)
    extension: CARD8        Present extension request number
    length: CARD16          2
    evtype: CARD16          Present_RedirectNotify
    eventID: PRESENTEVENTID
    event-window: WINDOW
    window: WINDOW
    pixmap: PIXMAP
    serial: CARD32
    valid-area: REGION
    valid-rect: RECTANGLE
    update-area: REGION
    update-rect: RECTANGLE
    x-off, y-off: INT16
    target-crtc: CRTC
    target-msc: CARD64
    idle-fence: FENCE
    update-window: BOOL

The target-crtc identifies which CRTC MSC domain the target-msc value relates to. divisor and remainder have been removed as the target-msc value has been adjusted using the application values. If update-window is True, then the recipient of this event is instructed to provide reasonably up-to-date contents directly to the window by copying the contents of pixmap to the window manually. Beyond these two protocol changes, the compositing manager is expected to receive Sync events when the idle-fence is triggered and then manually perform a Sync operation to trigger the client s idle-fence when appropriate. I m planning to work on these changes, and then go re-work xcompmgr (or perhaps unagi, which certainly looks less messy) to incorporate support for Present redirection. The goal is to have something to demonstrate at Guadec, which doesn t seem impossible, aside from leaving on vacation in four days

23 July 2013

Keith Packard: present sync

Implementing Vblank Synchronization in the Present Extension This is mostly a status update on how the Present extension is doing; the big news this week is that I ve finished implementing vblank synchronized blts and flips, and things seem to be working quite well. Vblank Synchronized Blts The goal here is to have the hardware executing the blt operation in such a way as to avoid any tearing artifacts. In current drivers, there are essentially two different ways to make this happen:

Insert a command into the ring which blocks execution until a suitable time immediately preceding the blt operation.
Queue the blt operation at vblank time so that it executes before the scanout starts.

Option 1. provides the fewest artifacts; if the hardware can blt faster than scanout, there shouldn t ever be anything untoward visible on the screen. However, it also blocks future command execution within the same context. For example, if two vblank synchronized blts are queued at the same time, its possible for the second blt to be delayed by yet another frame, causing both applications to run at half of the frame rate. Option 2. avoids blocking the hardware, allowing for ongoing operations to proceed on the hardware without waiting for the synchronized blt to complete. However, it can cause artifacts if the delay from the vblank event to the eventual execution of the blt command is too long. Queuing the blt right when it needs to execute means that we also have the opportunity to skip some blts; if the application presents two buffers within the same frame time, the blt of the first buffer can be skipped, saving memory bandwidth and time. Present uses Option 2, which may occasionally cause a tearing artifact, but avoids slowing down applications while allowing the X server to discard overlapping blt operations when possible. Queuing the Blt at Vblank There are several options for getting the blt queued and executed when the vblank occurs:

Queue the blt from the interrupt handler
Queue the blt from a kernel thread running in response to the interrupt
Send an event up to user space and have the X server construct the blt command.

These are listed in increasing maximum latency, but also in decreasing complexity. Option 1. is made more complicated as much of the work necessary to get a command queued to the hardware cannot be done from interrupt context. One can imagine having the desired command already present in the ring buffer and have the interrupt handler simply move the ring tail pointer value. Future operations to be queued before the vblank operation could then re-write the ring as necessary. A queued operation could also be adjusted by the X server as necessary to keep it correct across changes to the window system state. Option 2. is similar, but the kernel implementation should be quite a bit simpler as the queuing operation is done in process context and can use the existing driver infrastructure. For the X server, this is the same as Option 1, requiring that it construct a queued blt operation and deliver that to the kernel, and then revoke and re-queue if the X server state changed before the operation was completed. Option 3 is the simplest of all, requiring no changes within the kernel and few within the X server. The X server waits to receive a vblank notification event for the appropriate frame and then simply invokes existing mechanisms to construct and queue the blt operation to the kernel. Oddly, Present currently uses Option 3. If that proves to generate too many display artifacts, we can come back and change the code to try something more complicated. Flipping the Frame Buffer Taking advantage of the hardware s ability to quickly shift scanout from one chunk of memory to another is critical to providing efficient buffer presentation within the X server. It is slightly more complicated to implement than simply copying data to the current scanout buffer for a few reasons:

The presented pixmap is owned by the application, and so it shouldn t be used except when the presented window covers the whole screen. When the window gets reconfigured, we end up copying the window s pixmap to the regular screen pixmap.
The kernel flipping API is asynchronous, and doesn t provide any abort mechanism. This isn t usually much of an issue; we simply delay reporting the actual time of flip until the kernel sends the notification event to the X server. However, if the window is reconfigured or destroyed while the flip is still pending, cleaning that up must wait until the flip has finished.
The applications buffer remains busy until it is no longer being used for scanout; that means that applications will have to be aware of this and ensure that they don t deadlock waiting for the current scanout buffer to become idle before switching to a new scanout buffer.

Present is different from DRI2 in using application-allocated buffers for this operation. For DRI2, when flipping to a window buffer, that buffer becomes the screen pixmap the driver flips the new buffer object into the screen pixmap and releases the previous buffer object for other use. For Present, as the buffer is owned by the application, I figured it would be better to switch back to the real screen buffer when necessary. This also means that applications aren t left holding a handle to the frame buffer, which seems like it might be a nice feature. The hardest part of this work was dealing with client and server shutdown, dealing with objects getting deleted in random orders while other data structures retained references. (The kernel DRM drivers use the term page flipping to mean an atomic flip between one frame buffer and another, generally implemented by simply switching the address used for the scanout buffer. I d like to avoid using the word page in this context as we re not flipping memory pages individually, but rather a huge group of memory that forms an entire frame buffer. We could use plane flipping (as intel docs do), frame buffer flipping (but that s a mouthful), display flipping or almost anything but page flipping ). Overall DRI3000 Status At this point, the DRI3 extension is complete and the Present extension is largely complete, except for redirection for compositors. The few piglit tests for GLX_OML_sync_control all pass now, which is at least better than DRI2 does. I think I ve effectively replicated the essential features of DRI2 while offering room to implement a couple of new GL extensions:

GLX_EXT_swap_control_tear. This will provide applications with the ability to avoid dropping frames when pushing the hardware just over the frame rate limit.
EGL_EXT_buffer_age. (I assume we ll probably want a GLX version as well?) This will allow compositors to more efficiently perform partial updates in a flipping environment, and is enabled by having all of the buffer management within the GL library.

The code for this stuff has all been pushed to a number of repositories:

git://people.freedesktop.org/~keithp/dri3proto.git master. DRI3 protocol specification and X server headers.
git://people.freedesktop.org/~keithp/presentproto.git master. Present protocol specification and X server headers.
git://people.freedesktop.org/~keithp/xcb/proto.git dri3. XCB protocol defines for both DRI3 and Present.
git://people.freedesktop.org/~keithp/xcb/libxcb.git dri3. XCB library changes for file descriptor passing.
git://people.freedesktop.org/~keithp/xserver.git dri3. X server with file descriptor passing, DRI3 and Present support.
git://people.freedesktop.org/~keithp/mesa.git dri3 Mesa with DRI3/Present support for GLX.
git://people.freedesktop.org/~keithp/drm.git dri3. DRM library with defines for async flipping.
git://people.freedesktop.org/~keithp/xf86-video-intel.git dri3. Intel driver with DRI3, Present and async flipping support.
git://git.kernel.org/pub/scm/linux/kernel/git/keithp/linux-2.6 dri3. Kernel with async flipping.

22 July 2013

Keith Packard: async flip

Asynchronous Display Flipping for Linux Using display (or plane or page) flipping is a nice optimization for screen updates as it avoids copying the application render buffer to the separate scanout buffer. However, in the current DRM interface, the flip operation is not performed immediately. Rather, the flip operation must be presented to the kernel before the vertical blank interval preceding the desired display time. If the flip operation is even slightly delayed, the new scanout image will not be visible until the following frame. With many applications pushing the performance of the graphics hardware right to the edge, it s not at all uncommon for an occasional frame to just miss and cause an ugly jerk on the screen. The GLX_EXT_swap_control_tear extension provides a way for applications to request that when the swap request is delivered too late, instead of delaying for the next frame, the contents should be presented as soon as possible, even if that might cause some tearing artifacts on the screen. This gives the application developer the choice between the two visual artifacts (tearing vs judder). A Linux interface for Async Flipping This part was pretty easy to implement. First, a new capability value, DRM_CAP_ASYNC_PAGE_FLIP was added so that applications could discover whether the driver offered this feature or not. Second, a new flag value, DRM_MODE_PAGE_FLIP_ASYNC was added to the DRM_IOCTL_MODE_PAGE_FLIP ioctl. To get the new flag down into the driver required adding the flag parameter to the page_flip crtc function. Async Display Flipping support in Intel Hardware Recent Intel display hardware has support for asynchronous flipping, either through the ring using the MI_DISPLAY_FLIP instruction or directly through the registers. There are some limitations however:

The pitch of the current and new frame buffers must match
Both current and new frame buffers must use X tiling

I ve also only implemented flipping on Sandybridge and Ivybridge hardware as that s what I ve got for testing. Using Async Display Flipping with Present I m using async display flipping when the kernel supports it in the current Present implementation. This offers the ability to short-circuit multiple presentations targeting the same scanout frame. I haven t exposed the ability to select this to applications over the Present protocol yet, but that s not a big change. Implementing the GLX_EXT_swap_control_tear extension I haven t even looked at implementing this extension; I m hoping that once I ve gotten the functionality exposed over the X protocol that someone will at least help me figure out how to hook this up. Getting the Bits The interesting pieces here are the new kernel bits; the DRM library changes simply add new defines to existing APIs.

Kernel bits are in my kernel.org repository git://git.kernel.org/pub/scm/linux/kernel/git/keithp/linux-2.6 dri3
DRM library bits are in my freedesktop.org drm repository git://people.freedesktop.org/home/keithp/drm.git dri3

12 July 2013

Keith Packard: Present

The Present Extension I ve finally gotten an initial implementation of the Present extension written and running and thought I should write up the current design and status. Present Design The current Present extension consists of two requests:

PresentRegion. Puts new bits from a pixmap in a window.
PresentSelectInput. Asks for Present events to be delivered.

PresentRegion This request takes a pile of arguments:

 
    PresentRegion
    window: WINDOW
    pixmap: PIXMAP
    valid-area: REGION or None
    update-area: REGION or None
    x-off, y-off: INT16
    target-msc: CARD64
    divisor: CARD64
    remainder: CARD64
    idle-fence: FENCE
 
Errors: Drawable, Pixmap, Match

Provides new content for the specified window, to be made visible at the specified time (defined by target-msc , divisor and remainder ). update-area defines the subset of the window to be updated, or None if the whole window is to be updated. valid-area defines the portion of pixmap which contains valid window contents, or None if the pixmap contains valid contents for the whole window. PresentRegion may use any region of pixmap which contains update-area and which is contained by valid-area . In other words, areas inside update-area will be presented from pixmap , areas outside valid-area will not be presented from pixmap and areas inside valid-area but outside update-area may or may not be presented at the discretion of the X server. x-off and y-off define the location in the window where the 0,0 location of the pixmap will be presented. valid-area and update-area are relative to the pixmap. If target-msc is greater than the current msc for window , the presentation will occur at (or after) the target-msc field. Otherwise, the presentation will occur after the next field where msc % divisor == remainder . If divisor is zero, then the presentation will occur after the current field. idle-fence is triggered when pixmap is no longer in use. This may be at any time following the PresentRegion request, the contents may be immediately copied to another buffer, copied just in time for the vblank interrupt or the pixmap may be used directly for display, in which case it will be busy until some future PresentRegion operation. If window is destroyed before the presentation occurs, then the presentation action will not be completed. PresentRegion holds a reference to pixmap until the presentation occurs, so pixmap may be immediately freed after the request executes, even if that is before the presentation occurs. If idle-fence is destroyed before the presentation occurs, then idle-fence will not be signaled but the presentation will occur normally. PresentSelectInput

 
    PresentSelectInput
    event-id: PRESENTEVENTID
    window: WINDOW
    eventMask: SETofPRESENTEVENT
 
Errors: Window, Value, Match, IDchoice, Access

Selects the set of Present events to be delivered for the specified window and event context. PresentSelectInput can create, modify or delete event contexts. An event context is associated with a specific window; using an existing event context with a different window generates a Match error. If eventContext specifies an existing event context, then if eventMask is empty, PresentSelectInput deletes the specified context, otherwise the specified event context is changed to select a different set of events. If eventContext is an unused XID, then if eventMask is empty no operation is performed. Otherwise, a new event context is created selecting the specified events. Present Extension Events There are three different events for the Present extension:

PresentConfigureNotify
PresentCompleteNotify
PresentRedirectNotify

PresentConfigureNotify This event is moving from the DRI3 extension where it doesn t belong.

 
    PresentConfigureNotify
    type: CARD8         XGE event type (35)
    extension: CARD8        Present extension request number
    length: CARD16          2
    evtype: CARD16          Present_ConfigureNotify
    eventID: PRESENTEVENTID
    window: WINDOW
    x: INT16
    y: INT16
    width: CARD16
    height: CARD16
    off_x: INT16
    off_y: INT16
    pixmap_width: CARD16
    pixmap_height: CARD16
    pixmap_flags: CARD32

PresentConfigureNotify events are sent when the window configuration changes if PresentSelectInput has requested it. PresentConfigureNotify events are XGE events and so do not have a unique event type. x and y are the parent-relative location of window . PresentCompleteNotify

 
    PresentCompleteNotify
    type: CARD8         XGE event type (35)
    extension: CARD8        Present extension request number
    length: CARD16          2
    evtype: CARD16          Present_CompleteNotify
    eventID: PRESENTEVENTID
    window: WINDOW
    ust: CARD64
    msc: CARD64
    sbc: CARD64

Notify events are delivered when a PresentRegion operation has completed and the specified contents are being displayed. sbc, msc and ust indicate the swap count, frame count and system time of the related PresentRegion request. PresentRedirectNotify This one is not specified yet, but the intent is for it to contain sufficient information for the compositing manager to be able to construct a suitable screen update that includes an application window update. Finishing GLXOMLsync_control At this point, Mesa only exposes the old swapinterval configuration value, it doesn t provide any of the GLXOMLsynccontrol APIs. However, the Present protocol does have the bits necessary to support glXSwapBuffersMscOML in the PresentRegion request. Let s see how the remaining APIs in this GL extension will be supported. glXGetSyncValuesOML This one is easy; it only needs to return the UST/MSC/SBC values from the most recent SwapBuffers request. Those are returned in the PresentCompleteNotify event, so Mesa just needs to capture that event and save the values away for return to the application. We don t need any new Present protocol for this. glXGetMscRateOML This returns the refresh rate of the monitor associated with the specified drawable. RandR exposes all of the necessary data for each monitor, but the monitor each window is going to be synchronized against isn t exposed anywhere, and is effectively implementation-dependent. So, I think the easy thing to do here is to have a Present request which reports which RandR output a window will be synchronized with. glXSwapBuffersMscOML This is the one API which is already directly supported by the Present extension. glXWaitForMscOML This is effectively the same as glXSwapBuffersMscOML, except that it doesn t actually perform a swap. For this, I think we want a new request that generates an X event when the target values are reached, and then have the client block until that event is received. This will avoid blocking other threads using the same X connection. glXWaitForSbcOML Like glXWaitForMscOML, this just needs to trigger an event to be delivered at the right time. Presentation Redirection I m focusing on finishing the above stuff before I start writing the redirection spec and code, but I ve been thinking a bit about it. What we want is for applications that are presenting a new frame within a redirected window to have that presentation delivered directly to the compositing manager instead of having the bits copied to the redirected buffer and have the compositing manager only learn about this when the damage event is received from copying the bits. Once the compositing manager receives notification that there are new bits available for a portion of a window, it can then get those bits onto the screen in one of two ways:

Copy them to the back buffer for the whole screen and then present that back buffer to the screen.
Present them directly to the screen, bypassing the compositing manager s back buffer entirely.

The first option is what you d do if there were more updates than just a single window; it would update the whole screen at the same time. The second option is what you d do if there was only the one window update to act on, and this would automatically take advantage of page flipping to avoid copying data at all. No need to un-redirect windows for the page flip to work. In both cases, we need to inform the original application both when its pixmap is idle and when the swap actually hits the screen. And we need to make sure the final swap happens when the application requested. For the pixmap idle notification, the current PresentRegion idlefence argument should suffice; we just need to pass the idlefence XID along in the redirection event. Simple. For completion notification, we need to make sure the SBC value for the original window gets incremented and that a PresentCompleteNotify event is delivered. I think that means we want to append a chunk of data to the PresentRegion request so that suitable events can be delivered when the compositing manager PresentRegion occur. I think that just needs to be the window of that original window. To make sure the swap happens at the right time, we just need to have the targetmsc value provided to the compositing manager. I m hoping that having this all event driven is that when a later PresentRegion is redirected that has an *earlier* targetmsc, we can simply queue that as well and things should just work . Current Status I ve got the protocol definitions (both X server and XCB) done, and libxcb supporting the extension. I ve got the X server bits working, but the actual updates are not synchronized to the monitor; they re using OS timers and CopyArea for now, mostly so I can test things without also hacking drivers. Mesa is using the extension, but it only provides the swapinterval value and is not yet supporting the full GLXOMLsynccontrol extension. I ve also got a simple 2D core X application using the extension which is in the shmfd repository in my home directory on freedesktop.org. Availability As always, these bits are already published in my home directory on freedesktop.org in various git repositories.

4 June 2013

Keith Packard: dri3 extension

Completing the DRI3 Extension This week marks a pretty significant milestone for the glorious DRI3000 future. The first of the two new extensions is complete and running both full Gnome and KDE desktops. DRI3 Extension Overview The DRI3 extension provides facilities for building direct rendering libraries to work with the X window system. DRI3 provides three basic mechanisms:

Open a DRM device.
Share kernel objects associated with X pixmaps. The direct rendering client may allocate kernel objects itself and ask the X server to construct a pixmap referencing them, or the client may take an existing X pixmap and discover the underlying kernel object for it.
Synchronize access to the kernel objects. Within the X server, Sync Fences are used to serialize access to objects. These Sync Fences are exposed via file descriptors which the underlying driver can use to implement synchronization. The current Intel DRM driver passes a shared page containing a Linux Futex.

Opening the DRM Device Ideally, the DRM application would be able to just open the graphics device and start drawing, sending the resulting buffers to the X server for display. There s work going on to make this possible, but the current situation has the X server in charge of blessing the file descriptors used by DRM clients. DRI2 does this by having the DRM client fetch a magic cookie from the kernel and pass that to the X server. The cookie is then passed to the kernel which matches it up with the DRM client and turns on rendering access for that application. For DRI3, things are much simpler the DRM client asks the X server to pass back a file descriptor for the device. The X server opens the device, does the magic cookie dance all by itself (at least for now), and then passes the file descriptor back to the application.

 
    DRI3Open
    drawable: DRAWABLE
    driverType: DRI3DRIVER
    provider: PROVIDER
       
    nfd: CARD8
    driver: STRING
    device: FD
 
    Errors: Drawable, Value, Match
    This requests that the X server open the direct rendering
    device associated with drawable, driverType and RandR
    provider. The provider must support SourceOutput or SourceOffload.
    The direct rendering library used to implement the specified
    'driverType' is returned in 'driver'. The file
    descriptor for the device is returned in 'device'. 'nfd' will
    be set to one (this is strictly a convenience for XCB which
    otherwise would need request-specific information about how
    many file descriptors were associated with this reply).

Sharing Kernel Pixel Buffers An explicit non-goal of DRI3 is support for sharing buffers that don t map directly to regular X pixmaps. So, GL ancillary buffers like depth and stencil just don t apply here. The shared buffers in DRI3 are regular X pixmaps in the X server. This provides a few obvious benefits over the DRI2 scheme: In the kernel, the buffers are referenced by DMA-BUF handles, which provides a nice driver-independent mechanism.

Lifetimes are easily managed. Without being associated with a separate drawable, it s easy to know when to free the Pixmap.
Regular X requests apply directly. For instance, copying between buffers can use the core CopyArea request.

To create back- and fake-front- buffers for Windows, the application creates a kernel buffer, associates a DMA-BUF file descriptor with that and then sends the fd to the X server with a pixmap ID to create the associated pixmap. Doing it in this direction avoids a round trip.

 
    DRI3PixmapFromBuffer
    pixmap: PIXMAP
    drawable: DRAWABLE
    size: CARD32
    width, height, stride: CARD16
    depth, bpp: CARD8
    buffer: FD
 
    Errors: Alloc, Drawable, IDChoice, Value, Match
    Creates a pixmap for the direct rendering object associated
    with 'buffer'. Changes to pixmap will be visible in that
    direct rendered object and changes to the direct rendered
    object will be visible in the pixmap.
    'size' specifies the total size of the buffer bytes. 'width',
    'height' describe the geometry (in pixels) of the underlying
    buffer. 'stride' specifies the number of bytes per scanline in
    the buffer. The pixels within the buffer may not be arranged
    in a simple linear fashion, but 'size' will be at least
    'height' * 'stride'.
    Precisely how any additional information about the buffer is
    shared is outside the scope of this extension.
    If buffer cannot be used with the screen associated with
    drawable, a Match error is returned.
    If depth or bpp are not supported by the screen, a Value error
    is returned.

To provide for texture-from-pixmap, the application takes the pixmap ID and passes that to the X server which returns the a file descriptor for a DMA-BUF which is associated with the underlying kernel buffer.

 
    DRI3BufferFromPixmap
    pixmap: PIXMAP
       
    depth: CARD8
    size: CARD32
    width, height, stride: CARD16
    depth, bpp: CARD8
    buffer: FD
 
    Errors: Pixmap, Match
    Pass back a direct rendering object associated with
    pixmap. Changes to pixmap will be visible in that
    direct rendered object and changes to the direct rendered
    object will be visible in the pixmap.
    'size' specifies the total size of the buffer bytes. 'width',
    'height' describe the geometry (in pixels) of the underlying
    buffer. 'stride' specifies the number of bytes per scanline in
    the buffer. The pixels within the buffer may not be arranged
    in a simple linear fashion, but 'size' will be at least
    'height' * 'stride'.
    Precisely how any additional information about the buffer is
    shared is outside the scope of this extension.
    If buffer cannot be used with the screen associated with
    drawable, a Match error is returned.

Tracking Window Size Changes When Eric Anholt and I first started discussing DRI3, we hoped to avoid needing to learn about the window size from the X server. The thought was that the union of all of the viewports specified by the application would form the bounds of the drawing area. When the window size changed, we expected the application would change the viewport. Alas, this simple plan isn t sufficient here a few GL functions are not limited to the viewport. So, we need to track the actual window size and monitor changes to it. DRI2 does this by delivering invalidate events to the application whenever the current buffer isn t valid; the application discovers that this event has been delivered and goes to as the X server for the new buffers. There are a couple of problems with this approach:

Any outstanding DRM rendering requests will still draw to the old buffers.
The Invalidate events must be captured before the application sees the related ConfigureNotify event so that the GL library can react appropriately.

The first problem is pretty intractable within DRI2 the application has no way of knowing whether a frame that it has drawn was delivered to the correct buffer as the underlying buffer object can change at any time. DRI3 fixes this by having the application in control of buffer management; it can easily copy data from the previous back buffer to the new back buffer synchronized to its own direct rendering. The second problem was solved in DRI2 by using the existing Xlib event hooks; the GL library directly implements the Xlib side of the DRI2 extension and captures the InvalidateBuffers events within that code, delivering those to the driver code. The problem with this solution is that Xlib holds the Display structure mutex across this whole mess, and Mesa must be very careful not to make any Xlib calls during the invalidate call. For DRI3, I considered placing the geometry data in a shared memory buffer, but my future plans for the Present extension led me to want an X event instead (more about the Present extension in a future posting). An X ConfigureNotify event is sufficient for the current requirements to track window sizes accurately. However, there s no easy way for the GL library to ensure that ConfigureNotify events will be delivered to the application other application code may (and probably will) adjust the window event mask for its own uses. I considered adding the necessary event mask tracking code within XCB, but again, knowing that the Present extension would probably need additional information anyhow, decided to create a new event instead. Using an event requires that XCB provide some mechanism to capture those events, keep them from the regular X event stream, and deliver them to the GL library. A further requirement is that the GL library be absolutely assured of receiving notification about these events before the regular event processing within the application will see a core ConfigureNotify event. The method I came up with for XCB is fairly specific to my requirements. The events are always XGE events, and are tagged with a special event context ID , an XID allocated for this purpose. The combination of the extension op-code, the event type and this event context ID are used to split off these events to custom event queues using the following APIs:

/**
 * @brief Listen for a special event
 */
xcb_special_event_t *xcb_register_for_special_event(xcb_connection_t *c,
                                                    uint8_t extension,
                                                    uint16_t evtype,
                                                    uint32_t eid,
                                                    uint32_t *stamp);

This creates a special event queue which will contain only events matching the specified extension/type/event-id triplet.

/**
 * @brief Returns the next event from a special queue
 */
xcb_generic_event_t *xcb_check_for_special_event(xcb_connection_t *c,
                                                 xcb_special_event_t *se);

This pulls an event from a special event queue. These events will not appear in the regular X event queue and so applications will never see them. There s one more piece of magic here the stamp value passed to xcbregisterforspecialevent. This pointer refers to a location in memory which will be incremented every time an event is placed in the special event queue. The application can cheaply monitor this memory location for changes and known when to check the queue for events. Within GL, the value used is the existing dri2 stamp value. That is checked at the top of the rendering operation; if it has changed, the drawing buffers will be re-acquired. Part of the buffer acquisition process is a check for special events related to the window. For now, I ve placed these events in the DRI3 extension. However, they will move to the Present extension once that is working.

 
    DRI3SelectInput
    eventContext: DRI3EVENTID
    window: WINDOW
    eventMask: SETofDRI3EVENT
 
    Errors: Window, Value, Match, IDchoice
    Selects the set of DRI3 events to be delivered for the
    specified window and event context. DRI3SelectInput can
    create, modify or delete event contexts. An event context is
    associated with a specific window; using an existing event
    context with a different window generates a Match error.
    If eventContext specifies an existing event context, then if
    eventMask is empty, DRI3SelectInput deletes the specified
    context, otherwise the specified event context is changed to
    select a different set of events.
    If eventContext is an unused XID, then if eventMask is empty
    no operation is performed. Otherwise, a new event context is
    created selecting the specified events.

The events themselves look a lot like a configure notify event:

 
    DRI3ConfigureNotify
    type: CARD8         XGE event type (35)
    extension: CARD8        DRI3 extension request number
    length: CARD16          2
    evtype: CARD16          DRI3_ConfigureNotify
    eventID: DRI3EVENTID
    window: WINDOW
    x: INT16
    y: INT16
    width: CARD16
    height: CARD16
    off_x: INT16
    off_y: INT16
    pixmap_width: CARD16
    pixmap_height: CARD16
    pixmap_flags: CARD32
 
    'x' and 'y' are the parent-relative location of 'window'.

Note that there are a couple of odd additional fields offx, offy, pixmapwidth, pixmapheight and pixmap_flags are all place-holders for what I expect to end up in the Present extension. For now, in DRI3, they should be ignored. Synchronization The DRM application needs to know when various X requests related to its buffers have finished. In particular, when performing a buffer swap, the client wants to know when that completes, and be able to block until it has. DRI2 does this by having the application make a synchronous request from the X server to get the names of the new back buffer for drawing the next frame. This has two problems:

The synchronous round trip to the X server isn t free. Other running applications may cause fairly arbitrary delays in getting the reply back from the X server.
Synchronizing with the X server doesn t ensure that GPU operations are necessarily serialized between the application and the X server.

What we want is a serialization guarantee between the X server and the DRM application that operates at the GPU level. I ve written a couple of times (dri3k first steps and Shared Memory Fences) about using X Sync extension Fences (created by James Jones and Aaron Plattner) for this synchronization and wanted to get a bit more specific here. With the X server, a Sync extension Fence is essentially driver-specific, allowing the hardware design to control how the actual synchronization is performed. DRI3 creates a way to share the underlying operating system object by passing a file descriptor from application to the X server which somehow references that device object. Both sides of the protocol need to tacitly agree on what it means.

 
    DRI3FenceFromFD
    drawable: DRAWABLE
    fence: FENCE
    initially-triggered: BOOL
    fd: FD
 
    Errors: IDchoice, Drawable
    Creates a Sync extension Fence that provides the regular Sync
    extension semantics along with a file descriptor that provides
    a device-specific mechanism to manipulate the fence directly.
    Details about the mechanism used with this file descriptor are
    outside the scope of the DRI3 extension.

For the current GEM kernel interface, because all GPU access is serialized at the kernel API, it s sufficient to serialize access to the kernel itself to ensure operations are serialized on the GPU. So, for GEM, I m using a shared memory futex for the DRI3 synchronization primitive. That does not mean that all GPUs will share this same mechanism. Eliminate the kernel serialization guarantee and some more GPU-centric design will be required. What about Swap Buffers? None of the above stuff actually gets bits onto the screen. For now, the GL implementation is simply taking the X pixmap and copying it to the window at SwapBuffers time. This is sufficient to run applications, but doesn t provide for all of the fancy swap options, like limiting to frame rate or optimizing full-screen swaps. I ve decided to relegate all of that functionality to the as-yet-unspecified Present extension. Because the whole goal of DRI3 was to get direct rendered application contents into X pixmaps, the Present extension will operate on those X objects directly. This means it will also be usable with non-DRM applications that use simple X pixmap based double buffering, a class which includes most existing non-GL based Gtk+ and Qt applications. So, I get to reduce the size of the DRI3 extension while providing additional functionality for non direct-rendered applications. Current Status As I said above, all of the above functionality is running on my systems and has booted both complete KDE and Gnome sessions. There have been some recent DMA-BUF related fixes in the kernel, so you ll need to run the latest 3.9.x stable release or a 3.10 release candidate. Here s references to all of the appropriate git repositories: DRI3 protocol and spec:

git://people.freedesktop.org/~keithp/dri3proto      master

XCB protocol

git://people.freedesktop.org/~keithp/xcb/proto.git  dri3

XCB library

git://people.freedesktop.org/~keithp/xcb/libxcb.git dri3

xshmfence library:

git://people.freedesktop.org/~keithp/libxshmfence.git   master

X server:

git://people.freedesktop.org/~keithp/xserver.git    dri3

Mesa:

git://people.freedesktop.org/~keithp/mesa.git       dri3

Next Steps Now it s time to go write the Present extension and get that working. I ll start coding and should have another posting here next week.

22 May 2013

Keith Packard: Altos1.2.1

AltOS 1.2.1 TeleBT support, bug fixes and new AltosUI features Bdale and I are pleased to announce the release of AltOS version 1.2.1. AltOS is the core of the software for all of the Altus Metrum products. It consists of cc1111-based micro-controller firmware and Java-based ground station software. The biggest new feature for AltOS is the addition of support for TeleBT, our ground station designed to operate with Android phones and tablets. In addition, there s a change in the TeleDongle radio configuration that should improve range, some other minor bug fixes and new features in AltosUI AltOS Firmware Features and fixes There are bug fixes in both ground station and flight software, so you should plan on re-flashing both units at some point. However, there aren t any incompatible changes, so you don t have to do it all at once. New features:

TeleBT support.
Improved radio sensitivity. The TeleDongle receiver parameters have been tweaked to provide better reception.
TeleMini now completely resets all radio parameters in recovery mode (with the two outer debug pins connected) 434.550MHz, N0CALL, factory radio cal.

Bug fixes:

USB device fixes. This improves operation with Windows, avoiding hangs and errors in many cases.
Correct the Kalman filter error covariance matrix; the old parameters were built assuming continuous measurements.

AltosUI Easier to use AltosUI has also seen quite a bit of work for the 1.2.1 release. It s got several fun new features and a few bug fixes. New Graph UI features:

Show tool-tips with the value near the cursor.
Make the set of displayed values configurable. Add all of the available data values just in case you want to see them.
Added a Map tab showing the ground track of the whole flight.
The flight summary tab now includes the final GPS position. This lets you figure out where your rocket landed without replaying the whole flight.

Other new AltosUI features:

TeleBT support, including Bluetooth connections (Linux-only, at present).
Shows the callsign in the Monitor Idle and other command-mode windows so that you can tell what callsign is being used.
Show the block number when downloading flight data. This lets you see something happen even for longer flights.
Make the initial position of the AltosUI configurable so that you can position it out of the way of the rest of you desktop.
Distribute Mac OS X in .dmg format (Mac OS Disk Image); this means you don t need to explicitly unpack the bits.

Bug fixes:

Deal with broken networking while downloading map tiles. Tiles are now always downloaded asynchronously so that the UI doesn t freeze when the network is slow.

26 April 2013

Keith Packard: Shared Memory Fences

Shared Memory Fences In our last adventure, dri3k first steps, one of the future work items was to deal with synchronization between the direct rendering application and the X server. DRI2 handles this by performing a round trip each time the application starts using a buffer that was being used by the X server. As DRI3 manages buffer allocation within the application, there s really no reason to talk to the server, so this implicit serialization point just isn t available to us. As I mentioned last time, James Jones and Aaron Plattner added an explicit GPU serialization system to the Sync extension. These SyncFences serializing rendering between two X clients, but within the server there are hooks provided for the driver to use hardware-specific serialization primitives. The existing Linux DRM interfaces queue rendering to the GPU in the order requests are made to the kernel, so we don t need the ability to serialize within the GPU, we just need to serialize requests to the kernel. Simple CPU-based serialization gating access to the GPU will suffice here, at least for the current set of drivers. GPU access which is not mediated by the kernel will presumably require serialization that involves the GPU itself. We ll leave that for a future adventure though; the goal today is to build something that works with the current Linux DRM interfaces. SyncFence Semantics The semantics required by SyncFences is for multiple clients to block on a fence which a single client then triggers. All of the blocked clients start executing requests immediately after the trigger fires. There are four basic operations on SyncFences:

Trigger. Mark the fence as ready and wake up all waiting clients
Await. Block until the fence is ready.
Query. Retrieve the current state of the fence.
Reset. Unset the fence; future Await requests will block.

SyncFences are the same as Events as provided by Python and other systems. Of course all of the names have been changed to keep things interesting. I ll call them Fences here, to be consistent with the current X usage. Using Pthread Primitives One fact about pthreads that I recently learned is that the synchronization primitives (mutexes, barriers and semaphores) are actually supposed to work across process boundaries, if those objects are in shared memory mapped by each process. That seemed like a great simplification for this project; allocate a page of shared memory, map into the X server and direct rendering application and use the existing pthreads APIs. Alas, the pthread objects are architecture specific. I m pretty sure that when that spec was written, no-one ever thought of running multiple architectures within the same memory space. I went and looked at the code to check, and found that each of these objects has a different size and structure on x86 and x86_64 architectures. That makes it pretty hard to use this API within X as we often have both 32- and 64- bit applications talking to the same (presumably 64-bit) X server. As a last resort, I read through a bunch of articles on using futexes directly within applications and decided that it was probably possible to implement what I needed in an architecture-independent fashion. Futexes Linux Futexes live in this strange limbo of being a not-quite-public kernel interface. Glibc uses them internally to implement locking primitives, but it doesn t export any direct interface to the system call. Certainly they re easy to use incorrectly, but it s unusual in the Linux space to have our fundamental tools locked away for our own safety . Fortunately, we can still get at futexes by creating our own syscall wrappers.

static inline long sys_futex(void *addr1, int op, int val1,
                 struct timespec *timeout, void *addr2, int val3)
 
    return syscall(SYS_futex, addr1, op, val1, timeout, addr2, val3);

For this little exercise, I created two simple wrappers, one to block on a futex:

static inline int futex_wait(int32_t *addr, int32_t value)  
    return sys_futex(addr, FUTEX_WAIT, value, NULL, NULL, 0);

and one to wake up all futex waiters:

static inline int futex_wake(int32_t *addr)  
    return sys_futex(addr, FUTEX_WAKE, MAXINT, NULL, NULL, 0);

Atomic Memory Operations I need atomic memory operations to keep separate cores from seeing different values of the fence value, GCC defines a few such primitives and I picked _syncboolcompareandswap and _syncvalcompareandswap. I also need fetch and store operations that the compiler won t shuffle around:

#define barrier() __asm__ __volatile__("": : :"memory")
static inline void atomic_store(int32_t *f, int32_t v)
 
    barrier();
    *f = v;
    barrier();
 
static inline int32_t atomic_fetch(int32_t *a)
 
    int32_t v;
    barrier();
    v = *a;
    barrier();
    return v;

If your machine doesn t make these two operations atomic, then you would redefine these as needed. Futex-based Fences These wake-all semantics of Fences greatly simplify reasoning about the operation as there s no need to ensure that only a single thread runs past Await, the only requirement is that no threads pass the Await operation until the fence is triggered. A Fence is defined by a single 32-bit integer which can take one of three values:

0 - The fence is not triggered, and there are no waiters.
1 - The fence is triggered (there can be no waiters at this point).
-1 - The fence is not triggered, and there are waiters (one or more).

With those, I built the fence operations as follows. Here s Await:

int fence_await(int32_t *f)
 
    while (__sync_val_compare_and_swap(f, 0, -1) != 1)  
        if (futex_wait(f, -1))  
            if (errno != EWOULDBLOCK)
                return -1;
         
     
    return 0;

The basic requirement that the thread not run until the fence is triggered is met by fetching the current value of the fence and comparing it with 1. Until it is signaled, that comparison will return false. The compareandswap operation makes sure the fence is -1 before the thread calls futex_wait, either it was already -1 in the case where there were other waiters, or it was 0 before and is now -1 in the case where there were no waiters before. This needs to be an atomic operation so that the fence value will be seen as -1 by the trigger operation if there are any threads in the syscall. The futex_wait call will return once the value is no longer -1, it also ensures that the thread won t block if the trigger occurs between the swap and the syscall. Here s the Trigger function:

int fence_trigger(int32_t *f)
 
    if (__sync_val_compare_and_swap(f, 0, 1) == -1)  
        atomic_store(f, 1);
        if (futex_wake(f) < 0)
            return -1;
     
    return 0;

The atomic compareandswap operation will make sure that no Await thread swaps the 0 for a -1 while the trigger is changing the value from 0 to 1; either the Await switches from 0 to -1 or the Trigger switches from 0 to 1. If the value before the compareandswap was -1, then there may be threads waiting on the Fence. An atomic store, constructed with two memory barriers and a regular store operation, to mark the Fence triggered is followed by the futex_wake call to unblock all Awaiting threads. The Query function is just an atomic fetch:

int fence_query(int32_t *f)
 
    return atomic_fetch(f) == 1;

Reset requires a compareandswap so that it doesn t disturb things if the fence has already been reset and there are threads waiting on it:

void fence_reset(int32_t *f)
 
    __sync_bool_compare_and_swap(f, 1, 0);

A Request for Review Ok, so we ve all tried to create synchronization primitives only to find that our obvious implementations were full of holes. I d love to hear from you if you ve identified any problems in the above code, or if you can figure out how to use the existing glibc primitives for this operation.

12 April 2013

Keith Packard: dri3k first steps

DRI3K First Steps Here s an update on DRI3000. I ll start by describing what I ve managed to get working and then summarize discussions that happened on the xorg-devel mailing list. Private Back Buffers One of the big goals for DRI3000 is to finish the job of moving buffer management out of the X server and into applications. The only thing still allocated by DRI2 in the X server are back buffers; everything else moved to the client side. Yes, I know, this breaks the GLX requirement for sharing buffers between applications, but we just don t care anymore. As a quick hack, I figured out how to do this with DRI2 today allocate our back buffers separately by creating X pixmaps for them, and then using the existing DRI2GetBuffersWithFormat request to get a GEM handle for them. Of course, now that all I ve got is a pixmap, I can t use the existing DRI2 swap buffer support, so for now I m just using CopyArea to get stuff on the screen. But, that works fine, as long as you don t care about synchronization. Handling Window Resize The biggest pain in DRI2 has been dealing with window resize. When the window resizes in the X server, a new back buffer is allocated and the old one discarded. An event is delivered to invalidate the old back buffer, but anything done between the time the back buffer is discarded and when the application responds to the event is lost. You can easily see this with any GL application today resize the window and you ll see occasional black frames. By allocating the back buffer in the application, the application handles the resize within GL; at some point in the rendering process the resize is discovered, and GL creates a new buffer, copies the existing data over, and continues rendering. So, the rendered data are never lost, and every frame gets displayed on the screen (although, perhaps at the wrong size). The puzzle here was how to tell that the window was resized. Ideally, we d have the application tell us when it received the X configure notify event and was drawing the frame at the new size. We thought of a cute hack that might do this; track GL calls to change the viewport and make sure the back buffer could hold the viewport contents. In theory, the application would receive the X configure notify event, change the viewport and render at the new size. Tracking the viewport settings for an entire frame and constructing their bounding box should describe the size of the window; at least it should describe the intended size of the window. There s at least one serious problem with this plan applications may well call glClear before calling glViewport, and as glClear does not use the current viewport, instead clearing the whole window, we couldn t use the viewport as an indication of the current window size. However, what this exercise did lead us to realize was that we don t care what size the window actually is, we only care what size the application thinks it is. More accurately, the GL library just needs to be aware of any window configuration changes before the application, so that it will construct a buffer that is not older than the application knowledge of the window size. I came up with two possible mechanisms here; the first was to construct a shared memory block between application and X server where the X server would store window configuration changes and signal the application by incrementing a sequence number in the shared page; the GL library would simply look at the sequence number and reallocate buffers when it changed. The problem with the shared memory plan was that it wouldn t work across the network, and we have a future project in mind to replace GLX indirect rendering with local direct rendering and PutImage which still needs accurate window size tracking. More about that project in a future post though X Events to the Rescue So, I decided to just have the X server send me events when the window size changed. I could simply use the existing X configure notify events, but that would require a huge infrastructure change in the application so that my GL library could get those events and have the application also see them. Not knowing what the application is up to, we d have to track every ChangeWindowAttributes call and make sure the event_mask included the right bits. Ick. Fortunately, there s another reason to use a new event we need more information than is provided in the ConfigureNotify event; as you know, the Swap extension wants to have applications draw their content within a larger buffer that can have the window decorations placed around it to avoid a copy from back buffer to window buffer. So, our new ConfigureNotify event would also contain that information. Making sure that ConfigureNotify event is delivered before the core ConfigureNotify event ensures that the GL library should always be able to know about window size changes before the application. Splitting the XCB Event Stream Ok, so I ve got these new events coming from the X server. I don t want the application to have to receive them and hand them down to the GL library; that would mean changing every application on the planet, something which doesn t seem very likely at all. Xlib does this kind of thing by allowing applications to stick themselves into the middle of the event processing code with a callback to filter out the events they re interested in before they hit the main event queue. That s how DRI2 captures Invalidate events, and it works , but using callbacks from the middle of the X event processing code creates all kinds of locking nightmares. As discussed above, I don t care when GL sees the configure events, as long as it gets them before the application finds about about the window size change. So, we don t need to synchronously handle these events, we just need to be able to know they ve arrived and then handle them on the next call to a GL drawing function. What I ve created as a prototype is the ability to identify specific events and place them in a separate event queue, and when events are placed in that event queue, to bump a sequence number so that the application can quickly identify that there s something to process. Making the Event Mask Per-API Instead of Per-Client The problem described above about using the core ConfigureNotify events made me think about how to manage multiple APIs all wanting to track window configuration. For core events, the selection of which events to receive is all based on the client; each client has a single event mask, and each client receives one copy of each event. Monolithic applications work fine with this model; there s one place in the application selecting for events and one place processing them. However, modern applications end up using different APIs for 3D, 2D and media. Getting those libraries to cooperate and use a common API for event management seems pretty intractable. Making the X server treat each API as a separate entity seemed a whole lot easier; if two APIs want events, just have them register separately and deliver two events flagged for the separate APIs. So, the new DRI3 configure notify events are created with their own XID to identify the client-side owner of the event. Within the X server, this required a tiny change; we already needed to allocate an XID for each event selection so that it could be automatically cleaned up when the client exited, so the only change was to use the one provided by the client instead of allocating one in the server. On the wire, the event includes this new XID so that the library can use it to sort out which event queue to stick the event in using the new XCB event stream splitting code. Current Status The above section describes the work that I ve got running; with it, I can run GL applications and have them correctly track window size changes without losing a frame. It s all available on the dri3 branches of my various repositories for xcb proto, libxcb, dri3proto and the X server. Future Directions The first obvious change needed is to move the configuration events from the DRI3 extension to the as-yet-unspecified new Swap extension (which I may rename as Present , as in please present this pixmap in this window ). That s because they aren t related to direct rendering, but rather to tracking window sizes for off-screen rendering, either direct, indirect or even with the CPU to memory. DRI3 and Fences Right now, I m not synchronizing the direct rendering with the CopyArea call; that means the X server will end up with essentially random contents as the application may be mid-way through the next frame before it processes the CopyArea. A simple XSync call would suffice to fix that, but I want a more efficient way of doing this. With the current Linux DRI kernel APIs, it is sufficient to serialize calls that post rendering requests to the kernel to ensure that the rendering requests are themselves serialized. So, all I need to do is have the application wait until the X server has sent the CopyArea request down to the kernel. I could do that by having the X server send me an X event, but I think there s a better way that will extend to systems that don t offer the kernel serialization guarantee. James Jones and Aaron Plattner put together a proposal to add Fences to the X Sync extension. In the X world, those offer a method to serialize rendering between two X applications, but of course the real goal is to expose those fences to GL applications through the various GL sync extensions (including GLARBsync and GLNVfence). With the current Linux DRI implementation, I think it would be pretty easy to implement these fences using pthread semaphores in a block of memory shared between the server and application. That would be DRI-specific; other direct rendering interfaces would use alternate means to share the fences between X server and application. Swap/Present The Second Extension By simply using CopyArea for my application presentation step, I think I ve neatly split this problem into manageable pieces. Once I ve got the DRI3 piece working, I ll move on to fixing the presentation issue. By making that depend solely on existing core Pixmap objects as the source of data to present, I can develop that without any reference to DRI. This will make the extension useful to existing X applications that currently have only CopyArea for this operation. Presentation of application contents occurs in two phases; the first is to identify which objects are involved in the presentation. The second is to perform the presentation operation, either using CopyArea, or by swapping pages or the entire frame buffer. For offscreen objects, these can occur at the same time. For onscreen, the presentation will likely be synchronized with the scanout engine. The second form will mean that the Fences that mark when the presentation has occurred will need to signaled only once the operation completes. A CopyArea operation means that the source pixmap is ready immediately after the Copy has completed. Doing the presentation by using the source pixmap as the new front buffer means that the source pixmap doesn t become ready until after the next swap completes. What I don t know now is whether we ll need to report up-front whether the presentation will involve a copy or a swap. At this point, I don t think so the application will need two back buffers in all cases to avoid blocking between the presentation request and the presentation execution. Yes, it could use a fence for this, but that still sticks a bubble in the 3D hardware where it s blocked waiting for vblank instead of starting on the next frame immediately. Plan of Attack Right now, I m working on finishing up the DRI3 piece:

Replace the DRI2 buffer allocation kludge with actual local buffer allocation, mapping them into pixmaps using FD passing.
Replace the DRI2 authentication scheme with having the X server open the DRI object, preparing it for rendering and passing it back to the application.
Working on the XCB pieces to get the split event-queue stuff landed upstream.
Implementing the Fencing stuff to correctly serialize access to the pixmap.

The first three seem fairly straight forward. The fencing stuff will involve working with James and Aaron to integrate their XSync changes into the server. After that, I ll start working on the presentation piece. Foremost there is figuring out the right name for this new extension; I started with the name Swap as that s the GL call it implements. However, Swap is quite misleading as to the actual functionality; a name more like Present might provide a better indication of what it actually does. Of course, Present is both a verb and a noun, with very different connotations. Suggestions on this most complicated part of the project are welcome!

6 March 2013

Keith Packard: composite-swap

Composite and Swap Getting it Right Where the author tries to make sure DRI3000 is going to do what we want now and in the future DRI3000 The basic DRI3000 plan seems pretty straightforward:

Have applications allocate buffers full of new window contents, attach pixmap IDs to those buffers and pass them to the X server to get them onto the screen.
Provide a mechanism to let applications know when those pixmaps are idle so that they can reuse them instead of creating new ones for every frame.
Finally, allow the actual presentation of the contents to be scheduled for a suitable time in the future, generally synchronized with the monitor. Let the client know when this has happened in case they want to synchronize themselves to vblank.

The DRI3 extension provides a way to associate pixmap IDs and buffers, and given the MIT-SHM prototype I ve already implemented, I think we can safely mark this part as demonstrably implementable. That leaves us with a smaller problem, that of taking pixmap contents and presenting them on the screen at a suitable time and telling applications about the progress of that activity. In the absence of compositing, I m pretty sure the initial Swap extension design would do this job just fine, and should resolve some of the known DRI2 limitations related to buffer management. And, I think that goal is sufficient motivation to go and implement that. However, I wanted to write up some further ideas to see if the DRI3000 plan can be made to do precisely what we want in a composited world. The Composited Goal To make sure we re all on the same page, here s what I expect from the Swap extension in a composited world:

Application calls Swap with new window pixmap
Compositor hears about the new pixmap and uses that to construct a new screen pixmap
Compositor calls Swap with new screen pixmap
Vertical retrace happens, executing the pending swap operation
Compositor hears about the swap completion for the screen
Application hears about the swap completion for its window

In particular, applications should not hear that their swap operations are complete until the contents appear on the screen. This allows for applications to throttle themselves to the screen rate, either doing double or triple buffering as they choose. I didn t add steps here indicating buffers going idle or being allocated, because I think that should all happen behind the scenes from the application s perspective. Many applications won t care about the swap completion notification either, but some will and so that needs to be visible. Redirected Swaps? Owen Taylor suggested that one way of getting the compositor involved would be to have it somehow redirect Swap operations, much like we do with window management operations today. I think that idea may be a good direction to try:

Application calls Swap with new window pixmap
Swap is redirected to compositor, passing along the new window pixmap
Compositor constructs a new screen pixmap using the new window pixmap
Compositor calls Swap on the screen and the window, passing the new screen pixmap and the new window pixmap. When the screen update occurs, the screen and the window both receive swap completion events.

This has the added benefit that the X server knows when the compositor is expecting window pixmaps to change like this the compositor has to explicitly request Swap redirection. Window Pixmap Names and GEM Buffer Handles One issue that swapping window pixmaps around like this brings up is how to manage existing names for the window pixmap. Right now, applications expect that window pixmaps will only change when the window is resized. If the Swap extension is going to actually replace the window pixmap when running with a suitable compositor, then we need to figure out what the old names will reference. Are there non-compositor applications using NameWindowPixmap that matter to us? How about non-compositor applications using TextureFromPixmap to get a GEM handle for a window pixmap? For now, I m very tempted to just break stuff and see who complains, but knowing what we re breaking might be nice beforehand. Idling Pixmaps When an application is done drawing to a window pixmap and has passed it off to the X server for presentation, we d like for that pixmap to be automatically marked as discardable as soon as possible. This way, when memory is tight, the kernel can come steal those pages for something critical. Of course, applications may not want to let the server mark the pixmap as idle after being used, so a flag to the Swap call would be needed. Ideally, the pixmap would become idle immediately after the pixmap contents have been extracted. In the absence of a compositor, that would probably be when the Swap operation completes. With a compositor running, we d need explicit instruction from the compositor telling us that the window pixmap was now idle :

 
    SwapIdle
    drawable: Drawable
    pixmap: Pixmap

Furthermore, the application needs to know that the pixmap is in fact idle. I think that we ll need a synchronous X request that marks a buffer as no longer idle and have that return whether the buffer was discarded while idle. It doesn t seem sufficient to use events here as the application will need to completely reconstruct the pixmap contents in this case. This reply could also contain information about precisely what contents the pixmap does contain.

 
    SwapReuse
    drawable: Drawable
    pixmap: Pixmap
       
    valid: BOOL
    swap-hi: CARD32
    swap-lo: CARD32

Pixmap Lifetimes and Triple Buffered Applications If we redirect the Swap operation and send the original application window pixmap ID to the compositor, what happens when the application frees that pixmap before the compositor gets around to using the contents? Surely the Compositor must handle such cases, and not just crash. However, I m fine with requiring that the application not free the pixmap until told by the compositor.

28 February 2013

Keith Packard: x-on-resize

x-on-resize: a simple display configuration daemon I like things to be automated as much as possible, and having abandoned Gnome to their own fate and switched to xfce, I missed the automatic display reconfiguration stuff. I decided to write something as simple as possible that did just what I needed. I did this a few months ago, and when Carl Worth asked what I was using, I decided to pack it up and make it available. Automatic configuration with a shell script I ve had a shell script around that I used to bind to a key press which I d hit when I plugged or unplugged a monitor. So, all I really need to do is get this script run when something happens. The missing tool here was something to wait for a change to happen and automatically invoke the script I d already written. Resize vs Configure The first version of x-on-resize just listened for ConfigureNotify events on the root window. These get sent every time anything happens with the screen configuration, from hot-plug to notification when someone runs xrandr. That was as simple as possible; the application was a few lines of code to select for ConfigureNotify events, and invoke a program provided on the command line. However, it was a bit too simple as it would also respond to manual invocations of xrandr and call the script then as well. So, as long as I was content to accept whatever the script did, things were fine. And, with a laptop that had a DisplayPort connector for my external desktop monitor, and a separate VGA connector for projectors at conferences, the script always did something useful. Then I got this silly laptop that has only DisplayPort, and for which a dongle is required to get to VGA for projectors. I probably could write something fancy to figure out the difference between a desktop DisplayPort monitor and DisplayPort to VGA dongle, but I decided that solving the simpler problem of only invoking the script on actual hotplug events would be better. So, I left the current invoke-on-resize behavior intact and added new code that watched the list of available outputs and invoked a new config script when that set changed. The final program, x-on-resize, is available via git at

git://people.freedesktop.org/~keithp/x-on-resize

I even wrote a manual page. Enjoy!

20 February 2013

Keith Packard: DRI3000

DRI3000 Even Better Direct Rendering This all started with the presentation that Eric Anholt and I did at the 2012 X developers conference, and subsequently wrote about in my DRI-Next posting. That discussion sketched out the goals of changing the existing DRI2-based direct rendering infrastructure. Last month, I gave a more detailed presentation at Linux.conf.au 2013 (the best free software conference in the world). That presentation was recorded, so you can watch it online. Or, you can read Nathan Willis summary at lwn.net. That presentation contained a lot more details about the specific techniques that will be used to implement the new system, in particular it included some initial indications of what kind of performance benefits the overall system might be able to produce. I sat down today and wrote down an initial protocol definition for two new extensions (because two extensions are always better than one). Together, these are designed to provide complete support for direct rendering APIs like OpenGL and offer a better alternative to DRI2. The DRI3 extension Dave Airlie and Eric Anholt refused to let me call either actual extension DRI3000, so the new direct rendering extension is called DRI3. It uses POSIX file descriptor passing to share kernel objects between the X server and the application. DRI3 is a very small extension in three requests:

Open. Returns a file descriptor for a direct rendering device along with the name of the driver for a particular API (OpenGL, Video, etc).
PixmapFromBuffer. Takes a kernel buffer object (Linux uses DMA-BUF) and creates a pixmap that references it. Any place a Pixmap can be used in the X protocol, you can now talk about a DMA-BUF object. This allows an application to do direct rendering, and then pass a reference to those results directly to the X server.
BufferFromPixmap. This takes an existing pixmap and returns a file descriptor for the underlying kernel buffer object. This is needed for the GL Texture from Pixmap extension.

For OpenGL, the plan is to create all of the buffer objects on the client side, then pass the back buffer to the X server for display on the screen. By creating pixmaps, we avoid needing new object types in the X server and can use existing X apis that take pixmaps for these objects. The Swap extension Once you ve got direct rendered content in a Pixmap, you ll want to display it on the screen. You could simply use CopyArea from the pixmap to a window, but that isn t synchronzied to the vertical retrace signal. And, the semantics of the CopyArea operation precludes us from swapping the underlying buffers around, making it more expensive than strictly necessary. The Swap extension fills those needs. Because the DRI3 extension provides an X pixmap reference to the direct rendered content, the Swap extension doesn t need any new object types for its operation. Instead, it talks strictly about core X objects, using X pixmaps as the source of the new data and X drawables as the destination. The core of the Swap extension is one request SwapRegion. This request moves pixels from a pixmap to a drawable. It uses an X fixes Region object to specify the area of the destination being painted, and an offset within the source pixmap to align the two areas. A bunch of data are included in the reply from the SwapRegion request. First, you get a 64-bit sequence number identifying the swap itself. Then, you get a suggested geometry for the next source pixmap. Using the suggested geometry may result in performance improvements from the techniques described in the LCA talk above. The last bit of data included in the SwapRegion reply is a list of pixmaps which were used as source operands to earlier SwapRegion requests to the same drawable. Each pixmap is listed along with the 64-bit sequence number associated with an earlier SwapRegion operation which resulted in the contents which the pixmap now contains. Ok, so that sounds really confusing. Some examples are probably necessary.

If the SwapRegion operation was implemented by copying data out of the source pixmap into the destination drawable, then the idle swap count will be equal to the swap count from this SwapRegion operation.
If the SwapRegion operation was implemented by swapping the destination contents with the source contents, then the idle swap count will be equal to the previous swap count on the destination drawable.

I m hoping you ll be able to tell that in both cases, the idle swap count tries to name the swap sequence at which time the destination drawable contained the contents currently in the pixmap. Note that even if the SwapRegion is implemented as a Copy operation, the provided source pixmap may not be included in the idle list as the copy may be delayed to meet the synchronization requirements specfied by the client. Finally, if you want to throttle rendering based upon when frames appear on the screen, Swap offers an event that can be delivered to the drawable after the operation actually takes place. Because the Swap extension needs to supply all of the OpenGL SwapBuffers semantics (including a multiplicity of OpenGL extensions related to that), I ve stolen a handful of DRI2 requests to provide the necessary bits for that:

SwapGetMSC
SwapWaitMSC
SwapWaitSBC

These work just like the DRI2 requests of the same names. Current State of the Extensions Both of these extensions have an initial protocol specification written down and stored in git:

8 February 2013

Keith Packard: MicroPeakUSB

MicroPeak USB Interface now available. Altus Metrum is pleased to announce the immediate availability of the MicroPeak USB interface.

Connects the MicroPeak altimeter to USB.
Records 48 seconds of flight
Barometric pressure values recorded about 5 times per second.
Save, graph and export pressure, height, speed and acceleration data.
Supports Linux, Mac OS X and Windows. The software is available on the MicroPeak web page.
Learn more at http://www.altusmetrum.org/MicroPeakUSB
Buy these for $35 at the gag.com web store for Altus Metrum products. Or get one bundled with a MicroPeak and save $10.

MicroPeak USB Interface

MicroPeak and the MicroPeak USB Interface

MicroPeak is fun to use all by itself, providing a quick way to know how high your rocket has flown. But, for those people itching for more data, MicroPeakUSB offers a way to download raw flight data and analyze that on your computer. MicroPeakUSB doesn t require any changes to the MicroPeak hardware new MicroPeak firmware transmits the entire flight log through the on-board LED to a phototransistor on the MicroPeakUSB Interface and then to the USB port on your computer. Existing MicroPeak owners can contact us for a special deal on the MicroPeak USB interface and upgrading the MicroPeak firmware.

30 December 2012

Keith Packard: MicroPeakSerial

MicroPeak Serial Interface Flight Logging for MicroPeak MicroPeak was original designed as a simple peak-recording altimeter. It displays the maximum height of the last flight by blinking out numbers on the LED. Peak recording is fun and easy, but you need a log across apogee to check for unexpected bumps in baro data caused by ejection events. NAR also requires a flight log for altitude records. So, we wondered what could be done with the existing MicroPeak hardware to turn it into a flight logging altimeter. Logging the data The 8-bit ATtiny85 used in MicroPeak has 8kB of flash to store the executable code, but it also has 512B (yes, B as in bytes ) of eeprom storage for configuration data. Unlike the code flash, the little eeprom can be rewritten 100,000 times, so it should last for a lifetime of rocketry. The original MicroPeak firmware already used that to store the average ground pressure and minimum pressure (in Pascals) seen during flight; those are used to compute the maximum height that is shown on the LED. If we store just the two low-order bytes of the pressure data, we d have room left for 251 data points. That means capturing data at least every 32kPa, which is about 3km at sea level. 251 points isn t a whole lot of storage, but we really only need to capture the ascent and arc across apogee, which generally occurs within the first few seconds of flight. MicroPeak samples air pressure once every 96ms, if we record half of those samples, we ll have data every 192ms. 251 samples every 192ms captures 48 seconds of flight. A flight longer than that will just see the first 48 seconds. Of course, if apogee occurs after that limit, MicroPeak will still correctly record that value, it just won t have a continuous log. Downloading the data Having MicroPeak record data to the internal eeprom is pretty easy, but it s not a lot of use if you can t get the data into your computer. However, there aren t a whole lot of interfaces avaialble on MicroPeak. We ve only got:

The 6-pin AVR programming header. This is how we load firmware onto MicroPeak during manufacturing. It s not locked (of course), and the hardware supports reading and writing of flash, ram and eeprom.
The LED. We already use this to display the maximum height of the previous flight, can we blink it faster and then get the computer to read it out?

First implementation I changed the MicroPeak firmware to capture data to eeprom and made a test flight using my calibrated barometric chamber (a large syringe). I was able to read out the flight data using the AVR programming pins and got the flight logging code working that way. The plots I created looked great, but using an AVR programmer to read the data looked daunting for most people as it requires:

An AVR programmer. Adafruit sells the surprisingly useful USBtinyISP programmer. There are Windows drivers available for this, and you can get the necessary avrdude binaries from the usbtiny page above. The programmer itself comes in kit form, so you have to solder it together. There are other programmers available for a bit more than come pre-assembled, but all of them require that you wander around the net finding the necessary drivers and programming software.
A custom MicroPeak programming jig. We have these for sale in the Altus Metrum web store but, because they need special pogo pins and a pile of custom circuit boards, they re not cheap to make.

With the hardware running at least $120 retail, and requiring a pile of software installed from various places around the net, this approach didn t seem like a great way to let people easily capture flight data from their tiny altimeter. The Blinking LED The only other interface available is the MicroPeak LED. It s a nice LED, bright and orange and low power. But, it s still just a single LED. However, it seemed like it might be possible to have it blink out the data and create a device to watch the LED and connect that to a USB port. The simplest idea I had was to just blink out the data in asynchronous serial form; a start bit, 8 data bits and a stop bit. On the host side, I could use a regular FTDI FT230 USB to serial converter chip. Those even have a 3.3V regulator and can supply a bit of current to other components on the board, eliminating the need for an external power supply. To see the LED blink, I needed a photo-transistor that actually responds to the LED s wavelength. Most photo-transistors are designed to work with infrared light, which nicely makes the whole setup invisible. There are a few photo-transistors available which do respond in the visible range, and ROHM RPM-075PT actually has its peak sensitivity right in the same range as the LED. In between the photo-transistor and the FT230, I needed a detector circuit which would send a 1 when the light was present and a 0 when it wasn t. To me, that called for a simple comparator made from an op-amp. Set the voltage on the negative input to somewhere between light and dark and then drive the positive input from the photo-transistor; the output would swing from rail to rail. Bit-banging async The ATtiny85 has only a single serial port , which is used on MicroPeak to talk to the barometric sensor in SPI mode. So, sending data out the LED requires that it be bit-banged directly modulated with the CPU. I wanted the data transmission to go reasonably fast, so I picked a rate of 9600 baud as a target. That means sending one bit every 104 S. As the MicroPeak CPU is clocked at only 250kHz, that leaves only about 26 cycles per bit. I need all of the bits to go at exactly the same speed, so I pack the start bit, 8 data bits and stop bit into a single 16 bit value and then start sending. Of course, every pass around the loop would need to take exactly the same number of cycles, so I carefully avoided any conditional code. With that, 14 of the 26 cycles were required to just get the LED set to the right value. I padded the loop with 12 nops to make up the remaining time. At 26 cycles per bit, it s actually sending data at a bit over 9600 baud, but the FT230 doesn t seem to mind. A bit of output structure I was a bit worried about the serial converter seeing other light as random data, so I prefixed the data transmission with MP ; that made it easy to ignore anything before those two characters as probably noise. Next, I decided to checksum the whole transmission. A simple 16-bit CRC would catch most small errors; it s easy enough to re-try the operation if it fails after all. Finally, instead of sending the data in binary, I displayed each byte as two hex digits, and sent some newlines along to keep the line lengths short. This makes it easy to ship flight logs in email or whatever. Here s a sample of the final data format:

MP
dc880100fec000006800f56d8f63b059
73516447273fa93728301927d91b7712
730bbf0491fe88f7c5ee8ee896e3fadc
9dd9d3d502d1afcea2cbafc6b4c34ec1
bfbfcabf10c03dc05dc070c084c08fc0
9cc0abc0b9c0c1c0ccc0dcc020c152c4
71c9a6cf45d623db7de05ee758edd9f2
b4f9fd00aa074311631a9221c4291330
c035873b2943084bbb52695c0c67eb6b
d26ee5707472fb74a4781f7dee802b84
09860a87e786ad868a866e8659865186
4e8643863e863986368638862e862d86
2f862d86298628862a86268629862686
28862886258625862486
d925

Making the photo-transistor go fast enough The photo-transistor acts as one half of a voltage divider on the positive op-amp terminal, with a resistor making the other half. However, the photo-transistor acts a bit like a capacitor, so when I initially chose a fairly large value for the resistor, it actually took too long to switch between on and off the transistor would spend a bunch of time charging and discharging. I had to reduce the resistor to 1k for the circuit to work. Remaining hardware design I prototyped the circuit on a breadboard using a through-hole op-amp that my daughter designed into her ultrasonic guided robot and a prefabricated FTDI Friend board. I wanted to use the target photo-transistor, so I soldered a couple of short pieces of wire onto the SMT pads and stuck that into the breadboard. Once I had that working, I copied the schematic to gschem, designed a board and had three made at OSHPark for the phenomenal sum of $1.35. Aside from goofing up on the FT230 USB data pins (swapping D+ and D-), the board worked perfectly.

The final hardware design includes an LED connected to the output of the comparator that makes it easier to know when things are lined up correctly, otherwise it will be essentially the same. Host software Our AltosUI code has taught us a lot about delivering code that runs on Linux, Mac OS X and Windows, so I m busy developing something based on the same underlying Java bits to support MicroPeak. Here s a sample of the graph results so far:

Production plans I ve ordered a couple dozen raw boards from OSH Park, and once those are here, I ll build them and make them available for sale in a couple of weeks. The current plan is to charge $35 for the MicroPeak serial interface board, or sell it bundled with MicroPeak for $75.

23 November 2012

Keith Packard: MicroPeak

MicroPeak tiny peak-recording altimeter now available MicroPeak is a miniature peak-recording altimeter. About the same size and weight as a US dime (with battery), MicroPeak offers fabulous accuracy (20cm or 8in at sea level) and wide range (up to 31km or 101k ).

Uses the Measurement Specialties MS5607 barometric sensor.
Includes built-in battery holder for easily replaceable CR1025 lithium battery
Compact design is only 18mm x 14mm or 0.7 x 0.56 . Weighs 1.9g including the battery.
Low power design lasts for over 40 hours in flight.
Auto-poweroff on landing.
Learn more at the Altus Metrum web site
Buy these at the gag.com web store for Altus Metrum products

The size of the board was predicated with the premise that we needed a battery included to avoid having wiring running between the altimeter and the board, we found some small lithium coin-cell battery holders for the CR1025 battery. These battery holders are rated to hold the battery secure up to 150gs. We d already started playing with the Measurement Specialties MS5607 pressure sensor which offers amazing accuracy while using very little power. Taking full-precision measurements every 96ms consumes about .2mA on average. Once on the ground, we stop taking measurements entirely, dropping the power use to around 1 A. It s also pretty small, measuring only 5mm x 3mm. For a CPU, this little project didn t need much. The 8-bit ATtiny85 comes in a 20qwfn package which is only 4mm x 4mm. When run at full speed (8MHz), it consumes a couple of mA of power. Reduce the clock to a pokey 250kHz and the CPU has enough CPU power to track altitude while consuming less than .2mA on average. To avoid losing the battery, we wanted to avoid having it removed while the board wasn t in use. So, we added a little power switch to the board. The one we found is good to at least 50g. Finally, we wanted to find a nice bright LED to show the state of the device and to blink out the final altitude. The OSRAM LO T67K are bright-orange surface-mount LEDs that run happily on 2mA. We used OSHPark.com to create prototype circuit boards for this project. Because of the small size of the board, each prototype run cost only $2 for three boards. It takes a couple of weeks to get boards, but it s really hard to beat the price. All of the schematic and circuit board artwork are published under the TAPR Open Hardware License and are available via git. All of the source code is published under the GPLv2 and is included in the main AltOS source repository.

5 October 2012

Keith Packard: fd-passing

FD passing for DRI.Next Using the DMA-BUF interfaces to pass DRI objects between the client and server, as discussed in my previous blog posting on DRI-Next, requires that we successfully pass file descriptors over the X protocol socket. Rumor has it that this has been tried and found to be difficult, and so I decided to do a bit of experimentation to see how this could be made to work within the existing X implementation. (All of the examples shown here are licensed under the GPL, version 2 and are available from git://keithp.com/git/fdpassing) Basics of FD passing The kernel internals that support FD passing are actually quite simple POSIX already require that two processes be able to share the same underlying reference to a file because of the semantics of the fork(2) call. Adding some ability to share arbitrary file descriptors between two processes then is far more about how you ask the kernel than the actual file descriptor sharing operation. In Linux, file descriptors can be passed through local network sockets. The sender constructs a mystic-looking sendmsg(2) call, placing the file descriptor in the control field of that operation. The kernel pulls the file descriptor out of the control field, allocates a file descriptor in the target process which references the same file object and then sticks the file descriptor in a queue for the receiving process to fetch. The receiver then constructs a matching call to recvmsg that provides a place for the kernel to stick the new file descriptor. A helper API for testing I first write a stand-alone program that created a socketpair, forked and then passed an fd from the parent to the child. Once that was working, I decided that some short helper functions would make further testing a whole lot easier. Here s a function that writes some data and an optional file descriptor:

ssize_t
sock_fd_write(int sock, void *buf, ssize_t buflen, int fd)
 
    ssize_t     size;
    struct msghdr   msg;
    struct iovec    iov;
    union  
        struct cmsghdr  cmsghdr;
        char        control[CMSG_SPACE(sizeof (int))];
      cmsgu;
    struct cmsghdr  *cmsg;
    iov.iov_base = buf;
    iov.iov_len = buflen;
    msg.msg_name = NULL;
    msg.msg_namelen = 0;
    msg.msg_iov = &iov;
    msg.msg_iovlen = 1;
    if (fd != -1)  
        msg.msg_control = cmsgu.control;
        msg.msg_controllen = sizeof(cmsgu.control);
        cmsg = CMSG_FIRSTHDR(&msg);
        cmsg->cmsg_len = CMSG_LEN(sizeof (int));
        cmsg->cmsg_level = SOL_SOCKET;
        cmsg->cmsg_type = SCM_RIGHTS;
        printf ("passing fd %d\n", fd);
        *((int *) CMSG_DATA(cmsg)) = fd;
      else  
        msg.msg_control = NULL;
        msg.msg_controllen = 0;
        printf ("not passing fd\n");
     
    size = sendmsg(sock, &msg, 0);
    if (size < 0)
        perror ("sendmsg");
    return size;

And here s the matching receiver function:

ssize_t
sock_fd_read(int sock, void *buf, ssize_t bufsize, int *fd)
 
    ssize_t     size;
    if (fd)  
        struct msghdr   msg;
        struct iovec    iov;
        union  
            struct cmsghdr  cmsghdr;
            char        control[CMSG_SPACE(sizeof (int))];
          cmsgu;
        struct cmsghdr  *cmsg;
        iov.iov_base = buf;
        iov.iov_len = bufsize;
        msg.msg_name = NULL;
        msg.msg_namelen = 0;
        msg.msg_iov = &iov;
        msg.msg_iovlen = 1;
        msg.msg_control = cmsgu.control;
        msg.msg_controllen = sizeof(cmsgu.control);
        size = recvmsg (sock, &msg, 0);
        if (size < 0)  
            perror ("recvmsg");
            exit(1);
         
        cmsg = CMSG_FIRSTHDR(&msg);
        if (cmsg && cmsg->cmsg_len == CMSG_LEN(sizeof(int)))  
            if (cmsg->cmsg_level != SOL_SOCKET)  
                fprintf (stderr, "invalid cmsg_level %d\n",
                     cmsg->cmsg_level);
                exit(1);
             
            if (cmsg->cmsg_type != SCM_RIGHTS)  
                fprintf (stderr, "invalid cmsg_type %d\n",
                     cmsg->cmsg_type);
                exit(1);
             
            *fd = *((int *) CMSG_DATA(cmsg));
            printf ("received fd %d\n", *fd);
          else
            *fd = -1;
      else  
        size = read (sock, buf, bufsize);
        if (size < 0)  
            perror("read");
            exit(1);
         
     
    return size;

With these two functions, I rewrote the simple example as follows:

void
child(int sock)
 
    int fd;
    char    buf[16];
    ssize_t size;
    sleep(1);
    for (;;)  
        size = sock_fd_read(sock, buf, sizeof(buf), &fd);
        if (size <= 0)
            break;
        printf ("read %d\n", size);
        if (fd != -1)  
            write(fd, "hello, world\n", 13);
            close(fd);
         
     
 
void
parent(int sock)
 
    ssize_t size;
    int i;
    int fd;
    fd = 1;
    size = sock_fd_write(sock, "1", 1, 1);
    printf ("wrote %d\n", size);
 
int
main(int argc, char **argv)
 
    int sv[2];
    int pid;
    if (socketpair(AF_LOCAL, SOCK_STREAM, 0, sv) < 0)  
        perror("socketpair");
        exit(1);
     
    switch ((pid = fork()))  
    case 0:
        close(sv[0]);
        child(sv[1]);
        break;
    case -1:
        perror("fork");
        exit(1);
    default:
        close(sv[1]);
        parent(sv[0]);
        break;
     
    return 0;

Experimenting with multiple writes I wanted to know what would happen if multiple writes were made, some with file descriptors and some without. So I changed the simple example parent function to look like:

void
parent(int sock)
 
    ssize_t size;
    int i;
    int fd;
    fd = 1;
    size = sock_fd_write(sock, "1", 1, -1);
    printf ("wrote %d without fd\n", size);
    size = sock_fd_write(sock, "1", 1, 1);
    printf ("wrote %d with fd\n", size);
    size = sock_fd_write(sock, "1", 1, -1);
    printf ("wrote %d without fd\n", size);

When run, this demonstrates that the reader gets two bytes in the first read along with a file descriptor followed by one byte in a second read, without a file descriptor. This demonstrates that a file descriptor message forms a barrier within the socket; multiple messages will be merged together, but not past a message containing a file descriptor. Reading without accepting a file descriptor What happens when the reader isn t expecting a file descriptor? Does it just get lost? Does the reader not get the message until it asks for the file descriptor? What about the boundary issue described above? Here s my test case:

void
child(int sock)
 
    int fd;
    char    buf[16];
    ssize_t size;
    sleep(1);
    size = sock_fd_read(sock, buf, sizeof(buf), NULL);
    if (size <= 0)
        return;
    printf ("read %d\n", size);
    size = sock_fd_read(sock, buf, sizeof(buf), &fd);
    if (size <= 0)
        return;
    printf ("read %d\n", size);
    if (fd != -1)  
        write(fd, "hello, world\n", 13);
        close(fd);
     
 
void
parent(int sock)
 
    ssize_t size;
    int i;
    int fd;
    fd = 1;
    size = sock_fd_write(sock, "1", 1, 1);
    printf ("wrote %d without fd\n", size);
    size = sock_fd_write(sock, "1", 1, 2);
    printf ("wrote %d with fd\n", size);

This shows that the first passed file descriptor is picked up by the first sockfdread call, but the file descriptor is closed. The second file descriptor passed is picked up by the second sockfdread call. Zero-length writes Can a file descriptor be passed without sending any data?

void
parent(int sock)
 
    ssize_t size;
    int i;
    int fd;
    fd = 1;
    size = sock_fd_write(sock, "1", 1, -1);
    printf ("wrote %d without fd\n", size);
    size = sock_fd_write(sock, NULL, 0, 1);
    printf ("wrote %d with fd\n", size);
    size = sock_fd_write(sock, "1", 1, -1);
    printf ("wrote %d without fd\n", size);

And the answer is clearly no the file descriptor is not passed when no data are included in the write. A summary of results

read and recvmsg don t merge data across a file descriptor message boundary.
failing to accept an fd in the receiver results in the fd being closed by the kernel.
a file descriptor must be accompanied by some data.

Make X pass file descriptors I d like to get X to pass a file descriptor without completely rewriting the internals of both the library and the X server. Ideally, without making any changes to the existing code paths for regular request processing at all. On the sending side, this seems pretty straightforward we just need to get the X connection file descriptor and call sendmsg directly, passing the desired file descriptor along. In XCB, this could be done by using the xcbtakesocket interface to temporarily hijack the protocol as Xlib does. It s the receiving side where things are messier. Because a bare read will discard any delivered file descriptor, we must make sure to use recvmsg whenever we want to actually capture the file descriptor. Kludge X server fd receiving Because a passed fd creates a barrier in the bytestream, when the X server reads requests from a client, the read will stop sending data after the message with the file descriptor is consumed. Of course, this process consumes the passed file descriptor, and if that call isn t made with recvmsg set up to receive it, the fd will be lost. As a simple kludge, if we pass a meaningless fd with the X request and then the real fd with a following XNoOperation request, the existing request reading code will get the request, discard the meaningless fd and then stop reading at that point due to the barrier. Once into the request processing code, recvmsg can be called to get the real file descriptor and the associated XNoOperation request. I wrote a test for this that demonstrates how this works:

static void
child(int sock)
 
    uint8_t xreq[1024];
    uint8_t xnop[4];
    uint8_t req;
    int i, reqlen;
    ssize_t size, fdsize;
    int fd = -1, *fdp;
    int j;
    sleep (1);
    for (j = 0;; j++)  
        size = sock_fd_read(sock, xreq, sizeof (xreq), NULL);
        printf ("got %d\n", size);
        if (size == 0)
            break;
        i = 0;
        while (i < size)  
            req = xreq[i];
            reqlen = xreq[i+1];
            i += reqlen;
            switch (req)  
            case 0:
                break;
            case 1:
                if (i != size)  
                    fprintf (stderr, "Got fd req, but not at end of input %d < %d\n",
                         i, size);
                 
                fdsize = sock_fd_read(sock, xnop, sizeof (xnop), &fd);
                if (fd == -1)  
                    fprintf (stderr, "no fd received\n");
                  else  
                    FILE    *f = fdopen (fd, "w");
                    fprintf(f, "hello %d\n", j);
                    fflush(f);
                    fclose(f);
                    close(fd);
                    fd = -1;
                 
                break;
            case 2:
                fprintf (stderr, "Unexpected FD passing req\n");
                break;
             
         
     
 
int
tmp_file(int j)  
    char    name[64];
    sprintf (name, "tmp-file-%d", j);
    return creat(name, 0666);
 
static void
parent(int sock)
 
    uint8_t xreq[32];
    uint8_t xnop[4];
    int i, j;
    int fd;
    for (j = 0; j < 4; j++)  
        /* Write a bunch of regular requests */
        for (i = 0; i < 8; i++)  
            xreq[0] = 0;
            xreq[1] = sizeof (xreq);
            sock_fd_write(sock, xreq, sizeof (xreq), -1);
         
        /* Write our 'pass an fd' request with a 'useless' FD to block the receiver */
        xreq[0] = 1;
        xreq[1] = sizeof(xreq);
        sock_fd_write(sock, xreq, sizeof (xreq), 1);
        /* Pass an fd */
        xnop[0] = 2;
        xnop[1] = sizeof (xnop);
        fd = tmp_file(j);
        sock_fd_write(sock, xnop, sizeof (xnop), fd);
        close(fd);

Fixing XCB to receive file descriptors Multiple threads may be trying to get replies and events back from the X server at the same time, which means the kludge of having the real fd follow the message will likely lead to the wrong thread getting the file descriptor. Instead, I suspect the best plan will be to fix XCB to internally capture passed file descriptors and save them with the associated reply. Because the file descriptor message will form a barrier in the read stream, xcb can associate any received file descriptor with the last reply in the read data. The X server would then send the reply with an explicit sendmsg call to pass both reply and file descriptor together. Next steps The next thing to do is code up a simple fd passing extension and try to get it working, passing descriptors back and forth to the X server. Once that works, design of the rest of the DRM-Next extension should be pretty straightforward.

28 September 2012

Keith Packard: DRI-Next

Thoughts about DRI.Next On the way to the X Developer s Conference in Nuremberg, Eric and I chatted about how the DRI2 extension wasn t really doing what we wanted. We came up with some fairly rough ideas and even held an informal presentation about it. We didn t have slides that day, having come up with the content for the presentation in the hours just before the conference started. This article is my attempt to capture both that discussion and further conversations held over roast pork dinners that week. A brief overview of DRI2 Here s a list of the three things that DRI2 currently offers. Application authentication. The current kernel DRM authentication mechanism restricts access to the GPU to applications connected to the DRM master. DRI2 implements this by having the application request the DRM cookie from the X server which can then be passed to the kernel to gain access to the device. This is fairly important because once given access to the GPU, an application can access any flink d global buffers in the system. Given that the application sends screen data to the X server using flink d buffers, that means all screen data is visible to any GPU-accessing application. This bypasses any GPU hardware access controls. Allocating buffers. DRI2 defines a set of attachment points for buffers which can be associated with an X drawable. An application needing a specific set of buffers for a particular rendering operation makes a request of the X server which allocates the buffers and passes back their flink names. The server automatically allocates new buffers when window sizes change, sending an event to the application so that it knows to request the new buffers at some point in the future. Presenting data to the user. The original DRI2 protocol defined only the DRI2CopyRegion request which copied data between the allocated buffers. SwapBuffers was implemented by simply copy data from the back buffer to the front buffer. This didn t provide any explicit control over frame synchronization, so a new request, DRI2SwapBuffers, was added to expose controls for that. This new request only deals with the front and back buffers, and either copies from back to front or exchanges those two buffers. Along with DRI2SwapBuffers, there are new requests that wait for various frame counters and expose those to GL applications through the OMLsynccontrol extension What s wrong with DRI2? DRI2 fixed a lot of the problems present with the original DRI extension, and made reliable 3D graphics on the Linux desktop possible. However, in the four years since it was designed, we ve learned a lot, and the graphics environment has become more complex. Here s a short list of some DRI2 issues that we d like to see fixed.

InvalidateBuffers events. When the X window size changes, the buffers created by the X server for rendering must change size to match. The problem is that the client is presumably drawing to the old buffers when the new ones are allocated. Delivering an event to the client is supposed to make it possible for the client to keep up, but the reality is that the event is delivered at some random time to some random thread within the application. This leads to general confusion within the application, and often results in a damaged frame on the screen. Fortunately, applications tend to draw their contents often, so the damaged frame only appears briefly.
No information about new back buffer contents. When a buffer swap happens and the client learns about the new back buffer, the back buffer contents are always undefined. For most applications, this isn t a big deal as they re going to draw the whole window. However, compositing managers really want to reduce rendering by only repainting small damaged areas of the window. Knowing what previous frame contents are present in the back buffer allows the compositing manager to repaint just the affected area.
Un-purgable stale buffers. Between the X server finishing with a buffer and the client picking it up for a future frame, we don t need to save the buffer contents and should mark the buffer as purgable. With the current DRI2 protocols, this can t be done, which leaves all of those buffers hanging around in memory.
Driver-specific buffers. The DRI2 buffer handles are device specific, and so we can t use buffers from other devices on the screen. External video encoders/cameras/encoders can t be used with the DRI2 extension.
GEM flink has lots of issues. The flink names are global, allowing anyone with access to the device to access the flink data contents. There is also no reference to the underlying object, so the X server and client must carefully hold references to GEM objects during various operations.

Proposed changes for DRI.Next Given the three basic DRI2 operations (authentication, allocation, presentation), how can those be improved? Eliminate DRI/DRM magic-cookie based authentication Kristian H gsberg, Martin Peres, Timoth e Ravier & Daniel Vetter gave a talk on DRM2 authentication at XDC this year that outlined the problems with the current DRM access control model and proposed some fairly simple solutions, including using separate device nodes one for access to the GPU execution environment and a separate, more tightly controlled one, for access to the display engine. Combining that with the elimination of flink for communicating data between applications and there isn t a need for the current magic-cookie based authentication mechanism; simple file permissions should suffice to control access to the GPU. Of course, this ignores the whole memory protection issue when running on a GPU that doesn t provide access control, but we already have that problem today, and this doesn t change that, other than to eliminate the global uncontrolled flink namespace. Allocate all buffers in the application DRI2 does buffer allocation in the X server. This ensures that that multiple (presumably cooperating) applications drawing to the same window will see the same buffers, as is required by the GLX extension. We suspected that this wasn t all that necessary, and it turns out to have been broken several years ago. This is the traditional way in X to phase out undesirable code, and provides an excellent opportunity to revisit the original design. Doing buffer allocations within the client has several benefits:

No longer need DRI2 additions to manage new GL buffers. Adding HiZ to the intel driver required new DRI2 code in the X server, even though X wasn t doing anything with those buffers at all.
Eliminate some X round trips currently required for GL buffer allocation.
Knowing what s in each buffer. Because the client allocates each buffer, it can track the contents of them.
Size tracking is trivial. The application sends the GL the of the viewport, and the union of all viewports should be the same as the size of the window (or there will be undefined contents on the screen). The driver can use the viewport information to size the buffers and ensure that every frame on the screen is complete.

Present buffers through DMA-buf The new DMA-buf infrastructure provides a cross-driver/cross-process mechanism for sharing blobs of data. DMA-buf provides a way to take a chunk of memory used by one driver and pass it to another. It also allows applications to create file descriptors that reference these objects. For our purposes, it s the file descriptor which is immediately useful. This provides a reliable and secure way to pass a reference from an underlying graphics buffer from the client to the X server by sending the file descriptor over the local X socket. An additional benefit is that we get automatic integration of data from other devices in the system, like video decoders or non-primary GPUs. The Prime support added in DRI version 2.8 hacks around this by sticking a driver identifier in the driverType value. Once the buffer is available to the X server, we can create a request much like the current DRI2SwapBuffers request, except instead of implicitly naming the back and front buffers, we can pass an arbitrary buffer and have those contents copied or swapped to the drawable. We also need a way to copy a region into the drawable. I don t know if that needs the same level of swap control, but it seems like it would be nice. Perhaps the new SwapBuffers request could take a region and offset as well, copying data when swapping isn t possible. Managing buffer allocations One trivial way to use this new buffer allocation mechanism would be to have applications allocate a buffer, pass it to the X server and then simply drop their reference to it. The X server would keep a reference until the buffer was no longer in use, at which point the buffer memory would be reclaimed. However, this would eliminate a key optimization in current drivers the ability to re-use buffers instead of freeing and allocating new ones. Re-using buffers takes advantage of the work necessary to setup the buffer, including constructing page tables, allocating GPU memory space and flushing caches. Notifying the application of idle buffers Once the X server is finished using a buffer, it needs to notify the application so that the buffer can be re-used. We could send these notifications in X events, but that ends up in the twisty mess of X client event handling which has already caused so much pain with Invalidate events. The obvious alternative is to send them back in a reply. That nicely controls where the data are delivered, but causes the application to block waiting for the X server to send the reply. Fortunately, applications already want to block when swapping buffers so that they get throttled to the swap buffers rate. That is currently done by having them wait for the DRI2SwapBuffers reply. This provides a nice place to stick the idle buffer data. We can simply list buffers which have become idle since the last SwapBuffers reply was delivered. Releasing buffer memory Applications which update only infrequently end up with a back buffer allocated after their last frame which can t be freed by the system. The fix for this is to mark the buffer purgable, but that can only be done after all users of the buffer are finished with it. With this new buffer management model, the application effectively passes ownership of its buffers to the X server, and the X server knows when all use of the buffer are finished. It could mark buffers as purgable at that point. When the buffer was sent back in the SwapBuffers reply, the application would be able to ask the kernel to mark it un-purgable again. A new extension? Or just a new DRI2 version? If we eliminate the authentication model and replace the buffer allocation and presentation interfaces, what of the existing DRI2 protocol remains useful? The only remaining bits are the other synchronization requests: DRI2GetMSC, DRI2WaitMSC, DRI2WaitSBC and DRI2SwapInterval. Given this, does it make more sense to leave DRI2 as it is and plan on deprecating, and eventually eliminating, it? Doing so would place a support burden on existing applications, as they d need to have code to use the right extension for the common requests. They ll already need to support two separate buffer management versions though, so perhaps this burden isn t that onerous?

13 September 2012

Keith Packard: Altos1.1

AltOS 1.1 Bug fixes and some nice new features Bdale and I are pleased to announce the release of AltOS version 1.1. AltOS is the core of the software for all of the Altus Metrum products. It consists of cc1111-based micro-controller firmware and Java-based ground station software. We ve spent the last flying season chatting with people flying TeleMetrum and TeleMini boards and they came up with some great ideas to add to the system. AltOS Firmware Features and fixes There are bug fixes in both ground station and flight software, so you should plan on re-flashing both units at some point. However, there aren t any incompatible changes, so you don t have to do it all at once. New features:

Apogee-lockout timer. For situations where the normal apogee determination algorithm could be fooled, we ve added a timeout value to prevent premature firing of the apogee charge. Normal flights won t need this, but a couple of users asked for this feature.
RSSI value for Monitor Idle mode. The TeleDongle firmware has been updated to report signal strength information for data received from the altimeter. This allows the user to see how well the radio is working without having to switch to flight mode.
Force the radio to 434.550MHz. This is useful with TeleMini devices where the only way to talk to the device is through the radio. If you don t know the frequency, it s really hard to make that work.

Bug fixes:

Stale telemetry data reported when switching frequencies. TeleDongle would accidentally re-transmit old telemetry packets when the radio frequency was changed. This would be harmless except that when scanning to find the frequency used by an altimeter, you d appear to get packets at every frequency.

AltosUI Easier to use AltosUI has also seen quite a bit of work for the 1.1 release. There aren t any huge new features, but some activities are restructured to make them easier to navigate. And, of course, we ve fixed a bunch of bugs. New features:

Configure Ground Station activity. This provides a way to set the default radio frequency for each TeleDongle. This replicates the frequency menu present in the Monitor Flight activity, but doesn t also bring up that giant window.
Support the apogee lockout timer. This just adds another entry in the dialog for configuring the altimeter to configure the new timer. By default, the timer is disabled, allowing the apogee detection code in the flight computer to operate normally.
Add imperial units option. When enabled, this uses imperial units (feet and miles) for all values on the screen and in the voice announcements.

User interface changes:

Make the look-n-feel configurable. Java offers many different user interface styles on each platform. This exposes the available set and lets the user pick one. By default, we continue to use the native platform appearance.
Add an Age element to the Monitor Flight UI. This shows how long it has been since the last valid telemetry packet was received, making it easy to tell when communications are lost.
Change flight data downloading. Instead of selecting which to download and which to delete at the same time, the interface now has separate steps for downloading and then deleting files. This makes it easier to verify that the files were downloaded before deleting flights from the on-board memory.
Re-compute boost and landing times. Given the whole flight history, it s easy to find the time when the rocket started and stopped flying. Having these get recomputed means the boost time, acceleration values and main descent rates are computed more accurately.

Bug fixes:

Wait for 10 valid GPS messages before marking GPS as ready. Before this fix, GPS was marked as ready when 10 valid packets were received from the flight computer after the first valid GPS data arrived. This waits for 10 valid GPS packets instead.
Fix Google Earth file export. The format requirements for Google Earth files became more strict in recent releases; this patch changes how the files are formatted to make them work again.
Make AltosUI run on Mac OS X Lion . Apple changed the default heap size for Java applications with this release, dramatically reducing the memory available to applications. This would cause map tiles to fail to load and other random problems.
Improve COM port handling on Windows. This eliminates the need to wait 5 seconds between closing and re-opening devices, and also eliminates other spurious errors when opening devices.

Next.

Previous.